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Preface 



The International Workshops on Implementation of Functional Languages (IFL) 
are a tradition that has lasted for over a decade. The aim of these workshops 
is to bring together researchers to discuss new results and new directions of 
research related primarily but not exclusively to the implementation of func- 
tional or function-based languages. A not necessarily exhaustive list of topics 
includes: language concepts, type checking, compilation techniques, (abstract) 
interpretation, automatic program generation, (abstract) machine architectures, 
array processing, concurrent/parallel programming and program execution, heap 
management, runtime profiling and performance measurements, debugging and 
tracing, tools and programming techniques. 

IFL 2000 was held at SchloB Rahe, an 18th century castle in Aachen, Ger- 
many, during the first week of September 2000. It attracted 49 researchers from 
the international functional language community, presenting 33 contributions 
during the four days of the workshop. The contributions covered all topics men- 
tioned above. 

In addition, a special session organised by Thomas Arts from the Ericsson 
Computer Science Laboratory in Stockholm, Sweden, attracted several practi- 
tioners from industry who reported on their experiences using the functional 
language Erlang. The Erlang session was sponsored by Ericsson Computer Sci- 
ence Laboratory. 

This year, the workshop was sponsored by local industry (Ericsson Eurolab 
Deutschland GmbH and debis Systemhaus Aachen) , which indicates the growing 
importance of functional language concepts in commercial spheres. We thank our 
sponsors for their generous contributions. 

With this volume, we follow the lead of the last four IFL workshops in pub- 
lishing a high-quality subset of the contributions presented at the workshop in 
the Springer Lecture Notes in Computer Science series. All speakers attending 
the workshop were invited to submit a paper afterwards. Each of the 33 submis- 
sions was reviewed by three or four PC members and thoroughly discussed by 
the PC. We selected 15 papers for publication in this volume. 

The overall balance of the papers is representative, both in scope and tech- 
nical substance, of the contributions made to the Aachen workshop as well as to 
those that preceeded it. Publication in the LNCS series is not only intended to 
make these contributions more widely known in the computer science community 
but also to encourage researchers in the field to participate in future workshops, 
of which the next one will be held in Stockholm, Sweden in September 2001 (see 
http://www.ericsson.se/cslab/ifl2001/ for further details). 
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Non-determinism Analysis 
in a Parallel-Functional Language* 



Ricardo Pena and Clara Segura 

Universidad Complutense de Madrid, Spain 
{ricardo, cseguraj@sip.ucm.es 



Abstract. The paper presents several analyses to detect non-determi- 
nistic expressions in the parallel-functional language Eden. First, the 
need for the analysis is motivated, and then each one is presented. The 
first one is type-based, while the other two are based on abstract inter- 
pretation. Their power and efficiency is discussed, and an example is used 
to illustrate the differences. Two interesting functions to adapt abstract 
values to types appear, and they happen to be a Galois insertion. 



1 Introduction 

The paper presents several analyses to determine when an Eden |BLOP98] ex- 
pression is sure to be deterministic, and when it may be non-deterministic. 

The parallel-functional language Eden extends the lazy functional language 
Haskell by constructs to explicitly define and communicate processes. The three 
main new concepts are process abstractions, process instantiations and the non- 
deterministic process abstraction merge. Process abstractions of type Process a 
b can be compared to functions of type a -> b, and process instantiations can be 
compared to function applications. An instantiation is achieved by using the pre- 
defined infix operator (#) : : Process a b -> a -> b. Each time an expression 
el # e2 is evaluated, a new parallel process is created to evaluate (el e2). 

Non-determinism is introduced in Eden by means of a predefined process 
abstraction merge : : Process [[a]] [a] which fairly interleaves a set of input 
lists, to produce a single non-deterministic list. Its implementation immediately 
copies to the output list any value appearing at any of the input lists. So, merge 
can profitably be used to quickly react to requests coming in an unpredictable 
order from a set of processes. This feature is essential in reactive systems and 
very useful in some deterministic parallel algorithms [KPROOj . Eden is aimed at 
both types of applications. 

Eden has been implemented by modifying the Glasgow Haskell Compiler 
(GHC) |PHH~*~93] . GHC translates Haskell into a minimal functional language 
called Gore where a lot of optimizations [San95IPS98j are performed. Some of 
them are incorrect in a non-deterministic environment. So, a non-determinism 
analysis is carried out at Gore level and, as a result, variables are annotated as 

* Work partially supported by the Spanish-British Accion Integrada HB 1999-0102 
and Spanish projects TIC 97-0672 and TIC 2000-0738. 

M. Mohnen and P. Koopman (Eds.): IFL 2000, LNCS 2011, pp. 1- 1181 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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deterministic or (possibly) non-deterministic. After that, the dangerous trans- 
formations are disallowed if non-determinism is present. 

The plan of the paper is as follows: In Section 0 we review some approaches to 
non-determinism in functional languages and in particular in Eden. We further 
motivate the need for a non-determinism analysis. Section |3] presents a first 
simple analysis by using a type inference system. Section [4| provides a second 
view of this analysis as an abstract interpretation with no functional domains. 
The analysis is efficient and powerful enough for most purposes, but it loses 
precision in function applications and in process instantiations. A non-trivial 
example is also given. In Sectional a new abstract interpretation is developed, 
in which functions and process abstractions are interpreted as abstract functions. 
The previous example is used to illustrate the differences between both analyses, 
being the second one more powerful but less efficient. Section E] presents some 
related work in type based and abstract interpretation based analyses. Finally, 
Section |7] concludes and gives some guidelines on future work. 

2 Non-determinism in Eden 

The introduction of non-determinism in functional languages has a long tradition 
and has been a source of strong controversy. John McCarthy |McC63J introduced 
the operator amb : : a -> a -> a which non-deterministically chooses between 
two values. Henderson [Hen82J introduced instead merge : : [a] -> [a] -> [a] 
which non-deterministically interleaves two lists into a single list. Both operators 
violate referential transparency in the sense that it is no longer possible to replace 
equals by equals. For instance, let x = amb 0 1 in x + x / amb 0 1 + amb 0 1 
as the first expression may only evaluate to 0 or to 2, while the second one may 
also evaluate to 1. 

Hughes and O’Donnell proposed in | HO90| a functional language in which 
non-determinism is compatible with referential transparency. The idea is the 
introduction of the type Set a of sets of values to denote the result of non- 
deterministic expressions. The programmer explicitly uses this type whenever 
an expression may return one value chosen from a set of possible values. The 
implementation represents a set by a single value belonging to the set. Once 
a set is created, the programmer cannot come back to single values. So, if a 
deterministic function f is applied to a non-deterministic value (a set s), this 
must be expressed as f * S where (*) : : (a -> b) -> Set a -> Set b is the map 
function for sets. A limited number of set operations are allowed. The most 
important one is U (set union) that allows the creation of non-deterministic sets 
and can be used to simulate amb. Other, such as choose : : Set a -> a or O (set 
intersection) are disallowed either because they violate referential transparency 
or because they cannot be correctly implemented by ‘remembering’ one value 
per set. In the paper, a denotational semantics based on Hoare powerdomains 
is given for the language and a number of useful equational laws are presented 
so that the programmer can formally reason about the (partial) correctness of 
programs. 
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But the controversy goes further. In [ISSQOISS^ . the authors claim that 
what is really missing is an appropriate definition of referential transpareney. 
They show that several apparently equivalent definitions (replacing equals by 
equals, unfoldability of definitions, absence of side effects, definiteness of vari- 
ables, determinism, and others) have been around in different contexts and that 
they are not in fact equivalent in the presence of non-determinism. To situate 
Eden in perspective, we reproduce here their main concepts: 

Referential transparency. Expression e is purely referential in position p iff 
Vei,e 2 .|ei] p = |e 2 l p ^|e[ei/p]] p = |e[e 2 /p]l p. Operator op :: ••• 

tn^ t is referentially transparent if for all expressions e'=op ei • • • en, when- 
ever expression Cj, 1 < z < n is purely referential in position p, expression e 
is purely referential in position i.p. A language is referentially transparent if 
all of its operators are. 

Definiteness. Definiteness property holds if a variable denotes the same single 
value in all its occurrences. For instance, if variables are definite, the expres- 
sion (Ax.x — x)(amb 0 1) evaluates always to 0. If they are not, it may also 
evaluate to 1 and —1. 

Unfoldability. Unfoldability property holds if |(Aa;.e) e'] p = |e[e'/a^]] p for all 
e, e'. In presence of non-determinism, unfoldability is not compatible with 
definiteness. For instance, if variables are definite |(Aa;.a; — x){amb 0 1)] p 
|(am6 0 1) — {amb 0 1)] p. 

In the above definitions, the semantics of an expression is a set of values in the 
appropriate powerdomain. However, the environment p maps a variable into a 
single value in the case variables are definite (also called singular semantics), and 
to a set of values in the case they are indefinite (also called plural semantics). 

In Eden, the only source of non-determinism is the predefined process merge. 
When instantiating a new process by evaluating the expression el # e2, closure 
el, together with the closures of all the free variables referenced there, are copied 
(possibly unevaluated) to another processor where the new process is instanti- 
ated. However, within the same processor, a variable is evaluated at most once 
and its value is shared thereafter. We are still developing a denotational seman- 
tics for the language but, for the purpose of this discussion, we will assume that 
the denotation of an expression of type a is a (downwards and limit closed) 
set of values of type a representing the set of possible values returned by the 
expression. If the expression is deterministic, its denotation is a singleton. 

Under these premises, we can characterize Eden as referentially transparent. 
The only difference with respect to Haskell is that now, in a given environment 
p, an expression denotes a set of values instead of a single one. Inside an expres- 
sion, a non-deterministic subexpression can always be replaced by its denotation 
without affecting the resulting set of values. 

Variables are definite within the same process and are not definite within 
different processes. When an unevaluated non-deterministic free variable is du- 
plicated in two different processes, it may happen that the actual value computed 
by each process is different. However, denotationally both variables represent the 
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same set of values, so the semantics of the enclosing expressions will not change 
by the fact that the variable is evaluated twice. 

In general, in Eden we do not have the unfoldability property except in 
the case that the unfolded expression is deterministic. This is a consequence of 
having definite variables within a process. The motivation for a non-determinism 
analysis in Eden comes from the following two facts: 

— In future, Eden’s programmers may wish to have definite variables in all 
situations. It is sensible to think of having a compiler flag to select this 
semantic option. In this case, the analysis will detect the (possibly) non- 
deterministic variables and the compiler will force their evaluation to normal 
form before being copied to a different processor. 

— At present, some transformations carried out by the compiler in the opti- 
mization phases are semantically incorrect for non-deterministic expressions. 
The most important one is full laziness |, TPS 9 6] . The compiler will detect 
the non-deterministic bindings and disallow the floating of a let out of a 
lambda in these situations. Other dangerous transformations are the static 
argument transformation |San95j and the specialization (see [PPRSOO] for 
more details) . The general reason for all of them is the increasing of closure 
sharing: Before the transformation, several evaluation of a non-deterministic 
expression can produce several different values; after the transformation, 
a shared non-deterministic expression is once evaluated, yielding a unique 
value. 

To justify the last item, let us consider the following two bindings: 

f = Ay. let X = el f’ = let x = el 

in X + y in Ay . X + y 



If el is non-deterministic, the semantics of / is a non-deterministic function. 
Then, |/ 5 — / 5] p will deliver a non-single set of values as x is evaluated each 
time / is applied. The semantics of the expression bound to /' is instead a set 
of deterministic functions and, due to the definiteness of variables x and f ’ , 
If' 5 — /' 5] p evaluates always to {0}. So, the semantics has changed. 

3 A Type-Based Analysis 

The Language. As Eden is implemented by modifying GHC, the language 
being analysed is an extension of Core. This is a simple functional language with 
second-order polymorphism, so the language includes type abstraction and type 
application. In Figure [T] the syntax of the language and of the type expressions 
is shown. There, v denotes a variable, k denotes a literal and x denotes an atom 
(a variable or a literal). A program is a list of possibly recursive bindings from 
variables to expressions. Such expressions include variables, lambda abstractions, 
applications of a functional expression to an atom, constructor applications, 
primitive operators applications, and also case and let expressions. Constructor 
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prog — >■ 
bind — ^ 

expr 



alts — >■ 

Call — >■ 
Lalt — ^ 
Deft — >■ 
type — >• 



; . . . ; bindm 
V = expr 

I rec vi = expr j ; . . . ; Um 
expr X 
I Xv.expr 

I case espr of alts 
I let bind in expr 
I C X\ . . . Xm 
I op Xl...Xm 

I ® 

I Aa.ea;pr 
I expr type 
\v # X 

I process i; expr 
Calti ] . . . ; Caltm'i 

[Deft] 

C vi . . .Vm —t expr 
k expr 
V — ^ expr 
K 

I ® 

I T type^ . . . type^ 

I type^ ->■ type 2 
I Process typCj type^ 

I 'ia.type 



{non-recursive binding} 

= expr^ {recursive binding} 

{application to an atom} 

{lambda abstraction} 

{case expression} 

{let expression} 

{saturated constructor application} 
{saturated primitive operator application} 
{atom} 

{type abstraction} 

{type application} 

{process instantiation} 

{process abstraction} 

m > 0 
m > 0 

m > 0 {algebraic alternative} 

{primitive alternative} 

{default alternative} 

{basic types: Integers, characters} 

{type variables} 

{type constructor application} 

{function type} 

{process type} 

{polymorphic type} 



Fig. 1. Language dehnition and type expressions 



and primitive operators applications are saturated. The variables contain type 
information, so we will not write it explicitly in the expressions. 

The new Eden expressions are a process abstraction process u — >■ e, and a 
process instantiation v ff x. There is also a new type Process ti t2, see Figure [T] 
representing the type of a process abstraction process v ^ e where v has type 
ti and e has type t2- Frequently ti and t2 are tuple types, where each tuple 
element represents an input or an output channel of the process. 

The Annotations. As type information is already available in the language, we 
just need to annotate types. The analysis attaches non-determinism annotations 
to types. These are basic annotations n or d, or tuples of basic annotations. A 
basic annotation d in the type of an expression means that such expression is 
sure to be deterministic. For example, an integer constant is deterministic. A 
basic annotation n means that the expression may be non-deterministic. 

Tuples of basic annotations correspond to expressions of tuple type (or pro- 
cesses/functions returning tuples, see below) where each component carries its 
own annotation. The tuple type is treated in a special way; the rest of data 
types just carry a basic annotation. Processes usually have several input/output 
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(6l, ■ • • ; bm')^ — (^1) ■ ■ ■ ) bm') 



a — ^ b 



I (6l , . . . , bm) 



b(ti,...,t^) = ( 6 , • • . , 6 ) 

bt^^t2 ~ bprocess ti t2 ~ ^^2 

b'^a.t ~ bt 

bt = b if t = K,a,T ti . . .tm 



Fig. 2. Annotations and adaptation function definition 



channels, and this fact is represented by using tuples. In the implementation, an 
independent concurrent thread is provided for every output channel of a process. 
We would like to express which ones are deterministic and which ones may be 
non-deterministic. For example, in the following process abstraction 

process v — >■ case v of (vl,v2) — >■ let yl = vl in 

let y2 = merge # v2 in (yl,y2) 

we say that the first output is deterministic and that the second one may be non- 
deterministic. The same happens to functions returning tuples. As the internal 
tuples do not represent output channels only one level of tupling is maintained. 

Some Notation. In the following, b is used to denote a basic annotation and 
a to denote a basic annotation or a tuple of basic annotations, see Figure 
Regarding the types, t is used to denote the unannotated ones, see Figure H] 
and T or to denote the annotated ones. In the type environments A+\v :: 
denotes the extension of environment A with the annotated typing for v. In the 
typing rules of Figure El i ranges from 1 to to and j from 1 to U. Overlining is 
used to indicate an indexed sequence. For example, A+ [vi :: Ti] represents the 
extension of A with new typings for the variables Vi, , Vm- 

An ordering between the annotations is established, d 'Q n (naturally ex- 
tended to tuples). Several least upper bound (lub) operators can be defined, as 
well as an operator to flatten the internal tuples so that nested tuples do not 
appear; all of them are shown in FigureEl In the rules shown there it is necessary 
to adapt an annotation a of type t' to a type t in some places. This adaptation 
function, see Figure El is represented as a*: If a is basic and t is a tuple type, 
it replicates a to construct a tuple of the corresponding size; if it is already a 
tuple, it behaves as the identity. 

The Type System. In FigureElthe type system is shown. Rule [F4i?] is trivial. 
Rule [LIT] specifies that constants of basic types are deterministic. There are 
two rules for constructors: One for tuples [TUP] and another one [CONS] for the 
rest. In the first case, we obtain the annotation of each component, flatten them 
(if they are tuples, nesting must be eliminated) and give back the resulting tuple. 
In the second case, we also obtain the components’ annotations, flatten them 
and finally apply the lub operator, so that a basic annotation is obtained. This 
implies that, if any component of the construction may be non-deterministic. 
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A [v t\ \- V T 



VAR 



Ah k\: 



LIT 



data T = Ci ti 



A h Xj :: [Uj [oifc tk]] 

_ 

A h Ci xJ :: {T U) ^ 
A\- Xi 



CONS 



op :: ii -> (*2 . {tm i)) 

A\- Xi 



PRIM 



A\- (xi, . . . ,Xm) ■■■ (tl, . . . , t^)(Uai,...,Uam) 
A \v I- e :: 

ABS 



(U (Uai))t 
A \- op X\ . . . Xm t ^ 

A h e :: tl ^ *2 A \- x :: ti‘^2 

TUP APPLY 

Ah (e x) t2(Ua2)uai 

A+[v.: he:: t'“ 



A h (At;.e) t —¥ t' Ah process u ^ e :: Process^ t t' 

A h ei :: Tl A [v :: ti] h e :: T2 



PABS 



A h let V — €■]_ in e :: T2 



LETNONREC 



LETREC 



A [vi Ti] \- ei Ti A + [ui :: Ti] h e :: 

A h let rec Vi — Si in e :: 

A h p :: Process'^ t t' A x :: 

7 77 PINST MERGE 

A h pH^x :: where b — Ua' A h merge :: {Process [[cn]] [cn])^ 



Ahe:: (ti,...,tm)*'^^ ■ ■■ A -\- [vi ti ] h :: 

A h case e of (r?i , . . . , Vm) —te' t' 



C ASET UP 



data T ock — Ci ti 
A\-er.{T t^)” 



A + Aj h :: t^'i where A^ = [vij :: tvij ], tvij — tij [a^ t^] 

6U(|J ai) 



CASE A LG 



A h case e of Ci Vij —¥ ei :: t 
Ah e:-. K'> Ah ei-.-. C* 



bU 



(U -i) 



CASEPRIM 



A h . 



I e of fci — ^ Ci :: i 



A, a h e :: t“ Ahe:: {\/a.t)°- 

TYABS TYAPP 

A h Act.e :: (Va.t)“ A h (e t') :: tinst^'^'^^^^ where tinst — t[a ;= t'] 



n U b — n 
dU b — b 



b U (6i, . . . , bm) — {bi U 6, . . . , bm L-l b) 

{bi, . . . ,bm) U {b' i, . . . , b'rn) ^ (fcl U 6^ 1 , . . . , U b' m) 

Ub ^ b 

U(&1, . . . , b™) = IJ hi 



Fig. 3. Types annotation system 
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the whole expression will be considered as possibly non-deterministic; the infor- 
mation about the components is lost. 

Rules \ABS] and \PABS] need some explanation. What is a deterministic 
function/process? As merge process is the only source of non-determinism, we 
will say that an expression may be non-deterministic when it ‘contains’ any 
instantiation of this process. So we will consider that a function/process is de- 
terministic if it does not generate non-deterministic results from deterministic 
arguments. This means that we are only interested in the result of the function 
when it is applied to a deterministic argument. So, in the rule [ABS\ the an- 
notation attached to the function is the one obtained for the body when in the 
environment the argument is assigned a deterministic annotation. If the body 
gets a deterministic annotation, the function is deterministic; but if the body may 
be non-deterministic then the function may be non-deterministic. The determin- 
istic annotation given to the argument is an adaptation of the basic annotation d 
to the type of the argument, see Figured For example, if it is a n-tuple, the an- 
notation should be an n-tuple (d, . . . , d). If the argument were non-deterministic 
we are always assuming that the result may be non-deterministic. This means 
that we are not expressing how the output depends on the input. In Section |5] 
we will see that this leads to some limitations of the analysis. The lack of such 
information is reflected in the [APPLY] rule. The result of the application may 
be non-deterministic either the function is annotated as non-deterministic or 
the argument is annotated as non-deterministic. This is expressed by using a 
lub operator. If the argument’s annotation is a tuple, then we have to previously 
flatten it as we cannot use the information that its components provide. Such 
information (independent annotations) is used when the components are sepa- 
rately used in different parts of the program, and this is what usually happens 
with processes: Each output channel feeds a different process. 

Rules [PABS] and [PINST] are analogous to [ABS] and [APPLY], In [PRLM] 
rule, primitive operators are considered as deterministic, so we just flatten the 
annotations of the arguments and apply a lub operator to them. Finally, the 
annotation is adapted to the type of the result. The [MERGE] rule specifies 
that merge may be a non-deterministic process (in fact, it is the source of non- 
determinism). The [LETNONREC] and [LETREC] rules are the expected ones: 
The binders are added to the environment with the annotations of the right 
hand sides of their bindings. 

An algebraic case expression may be non-deterministic if either the discrimi- 
nant expression (the choice between the alternatives could be non-deterministic) 
or any of the expressions in the alternatives may be non-deterministic. This 
is expressed in the [CASEALG] rule. However if the discriminant is a tuple, 
there is no non-deterministic choice between the alternatives. This informa- 
tion is just passed to the right hand side of the alternative, so that only if 
the non-deterministic variables are used there, the result will be annotated as 
non-deterministic. This is reflected in the [GASETUP] rule. In general, the same 
applies to those types with only one constructor, so it could be extended to all 
such types. In these two case rules, the annotation obtained from the discrimi- 
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Basic = {d, n} where d C n 
Dk = Da = Dt = Basic 

D(ti,...,t^) = {{bi, . . . , &m) I 6i £ Basic} 

Dt^—^t2 D Pj-Qcess ti £2 D±2 

D\/a.t = Dt 



Fig. 4. Abstract domains 

nant has to be adapted to the types of the variables in the left hand side of the 
alternatives. In the [CASEALG] rule the discriminant annotation is just a basic 
annotation that represents the whole structure. If it is deterministic, then we can 
say that each of the components of the value is deterministic; and in case it is 
non-deterministic, we have to say that each component is non-deterministic, as 
we have lost information when annotating the discriminant. In the [CASETUP] 
rule each component has its own annotation, so we don’t lose so much informa- 
tion, but, as there are no nested tuples, we still have to adapt each annotation 
to the component’s type. The optional default alternative has not been included 
in the figure for clarity but it is easy to do. 

We have type polymorphism but not annotation polymorphism. In [TYABS] 
rule A, a means that cr is a type variable not free in A. When the instantiation 
of a polymorphic type takes place, it is necessary to adapt the annotation of the 
polymorphic type to the instantiated type. This is necessary when new struc- 
ture arises from the instantiation. For example if we apply the identity process 
Aa. process v ^ v, with annotated type (y a. Process a a)^, to {Int, Int) we need 
to adapt d to Process {Int, Int) {Int, Int), which produces {d, d), see Figure^] If 
the external structure was already a tuple, the annotation is maintained. For ex- 
ample, in Aa.process v — >■ {v, v), with annotated type {Wa. Process a {a, 
the adaptation gives back the same annotation {d, d). 

4 Abstract Interpretation 

The analysis of the previous section has several limitations, explained in Sec- 
tion |5] In this section an abstract interpretation version of the analysis is pre- 
sented. This version will lead us to develop a more powerful analysis, also ab- 
stract interpretation based, in which we will be able to overcome these limita- 
tions. Such extension does not seem so evident in the type annotation system. 

The Abstract Domains. The type system is directly related to an abstract 
interpretation where the domains corresponding to functions/processes are iden- 
tified with their range domains. This means there are no functional domains, so 
the fixpoint calculation is less expensive. Figure 0] shows the abstract domains. 

There is a basic domain Basic that corresponds to the annotations d and 
n in the previous section, with the same ordering. This is the abstract domain 
corresponding to basic types and algebraic types (except tuples). Tuples are 
again specially treated, as tuples of basic abstract values. The abstract domain 
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{v} pv 
[fc] p = d 

[(a:i, . . . , Xm)j p = (U(|I®il| p), . . . , U(|a:m] p)) 

[C xi . . . 2:^1 p = □ U(|xi] p) 

\ex\ p= (Q(|a;] p)) U |e] p 

[op xi . .. Xmj p = (U Q(NI p))t where op :: ti ^ {t 2 ^ . {tm -s- t)) 

\p#x\ p = Q([a;] p) U [p] p 
[An.e] p = [e] p [n !->■ dt] where v :: t 
[process v ^ e\ p = \e\ p \v dt] where v w t 
[merpe] p = n 

[let V = e in e'] p = [e'] p [v ^ [e] p] 

[let rec {vi = a} in e'] p = [e'J {fix (A p'.p [vj i->- [ej] p']) ) 

[case e of (vi, , Vm) ->■ e'] p = [e'] p \vi ^ 7Ti([e] p)jJ where vt :: U 
[case e of Ci nij — >• 6i] p = & U (|J [ci] pi) 

where fe = [e] p 

pi ~ P [iiij ' ^ ^iij]^^ij • • ^ij 
[case e of fci aj p = [e] p U (□ [e/] p) 

[ylo.e] p = [e] p 

[e t| p = ([e] p)unat where (e t) :: tinst 



Fig. 5. Abstract interpretation 

corresponding to functions and processes is the abstract domain corresponding 
to the type of the result. The abstract domain of a polymorphic type is that of 
its smallest instance, i.e. that one in which K is substituted for the type variable. 
So the domain corresponding to a type variable is Basic. 

The Abstract Interpretation. In Figure E] the abstract interpretation is 
shown. It is very similar to the type annotation system, so we just outline some 
specific details. In the recursive let we have to calculate a fixpoint, which can be 
obtained by using the Kleene’s ascending chain: 

[let rec {v^ = ej in e'] p = [e'J (U„61 m(V-P K ^ bil P']T{Po)) 

where po is an environment in which all variables have as abstract value the 
infimum of its corresponding abstract domain. At each iteration, the abstract 
values of bindings’ right hand sides are computed and the environment is updated 
until no changes are found. Termination is assured, as the abstract domains 
corresponding to each type are finite. The number of iterations is 0{N), where 
N is the total number of ‘components’ in the bindings. We consider a non- 
tuple variable as a single component, and a tuple variable v :: (ti, . . . ,tm) as m 
components. 

From the abstract interpretation we have obtained an algorithm that anno- 
tates each subexpression with its abstract value. This algorithm has been imple- 
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rw = A worker. A ts. 
let rec 
t = (ts, is) 
ys = manager # t 
yl = case ys of (zl,z2,z3) 
y2 = case ys of (ul,u2,u3) 
y3 = case ys of (vl,v2,v3) 

01 = worker # yl 

02 = worker # y2 
os = [ol, o2] 

is = merge # os 
in y3 

manager = process ts — >■ case ts of 
(tl, t2) — >■ (g tl t2,h 




Fig. 6. Replicated workers process structure where n — 2 

merited in Haskell, and we have executed it with some examples, one of which is 
shown below. This algorithm uses syntax-driven recursive calls that accumulate 
variables in the environment as necessary. This is equivalent to a bottom-to-top 
pass in the type annotation rules where only the type environments are built. 
When recursive calls finish, lub operations are carried out. This is equivalent to 
the application of the types annotation rules from top to bottom, once we have 
the appropriate environment. 

An Example: Replicated Workers. This example shows a simplified version 
of a replicated workers topology |KPR00| . We have a manager process and n 
worker processes. The manager provides the workers with tasks. When any of 
the workers finishes its task, it sends a message to the manager including the 
obtained results and asking for a new task. In order to enable the manager to 
assign new tasks to idle processes inmediately, even though the answers may be 
received from the workers in any order, a merge process is needed. The function 
rw representing this scheme when n = 2 is shown in Figure El where worker is 
the worker process and ts is an initial list of tasks to be done by the workers. 
The output of the manager process manager usually depends on both input lists, 
the initial one ts and that produced by the workers os. However, in order to 
compare the power of this analysis with that one presented in Section |5] we 
are assuming that manager is defined as shown in Figure E] where g, h and r are 
deterministic functions (i.e they have as abstract value d). So, manager’s abstract 
value is (d, d, d). With this definition, the third component of the process output 
only depends on the first input list. This means that the final result of the 
function rw only depends on the initial tasks list. In Figure[B|the process topology 
with the annotations in each channel is shown. As there are mutually recursive 
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definitions, all of them get the n annotation. However, we know that that when 
ts is deterministic, the result of the function is also deterministic, as it only 
depends on that initial list, and not on the one coming from the merge process. 
The analysis answer is safe but just approximate. In Section | 5 ] a more powerful 
analysis is presented. There the results are more accurate. 

In the real applications of this scheme, although the manager receives the 
results from the workers in any order, it sorts them and the output of the whole 
structure is deterministic. However this cannot be detected with the analysis 
and the result still produces an n annotation. 

5 Limitations and Refinement 

Limitations. As we have previously said, there are no functional abstract do- 
mains in this analysis. This means that the fixpoint calculation is not expensive, 
but it also imposes some limitations to the analysis. For instance, we cannot 
express the dependency of the result with respect to the argument. As we have 
said before, in a function application or process instantiation, we cannot fully 
use the information provided by the argument. 

This happens, for example, when the function does not depend on any of 
its arguments. For example, if we define the function f v = 5 , the analysis tells 
us that the function is deterministic, but when we apply / to a possibly non- 
deterministic value, the result of the application is established as possibly non- 
deterministic. This is not true, as we will always obtain the same unique value. 
Of course this a safe approximation, but not a very accurate one. The function 
g V = V would have the same abstract behaviour. Both / and g are deterministic 
functions, but they have different levels of determinism: / does not depend on its 
argument, but g does. If functions and processes were interpreted as functions, 
this limitation would be over. The abstract function corresponding to / would 
be = Xz.d, while that of g, g"^, would be Xz.z. Now, if we applied to n, 
we would obtain d, but n in the case of g"^ . The same happens when tuples are 
involved. The abstract value of 

h v\ V2 V3 = let u = merge # vs in (vi ,V2,u) 

is (d, d, n). But when we apply it, if any of the arguments has n as abstract value, 
the result will be (n, n, n). If the abstract value were = Xzi-Xz2-Xz3.(zi, Z2,n) 
then we would not lose so much information, for example d n d = {d,n,n). 

So the solution seems to be to interpret the functions and processes as ab- 
stract functions, and this is what we do in what follows. 

The Second Abstract Interpretation. We will denote everything related 
to this analysis with the underscript 2 to distinguish it from the previous one. 
In Figure [7| the abstract domains are shown. There are two differences. The 
abstract domain corresponding to a function/process is the domain of contin- 
uous functions between the abstract domains of the argument and the result. 
The tuple type now is interpreted as the cartesian product of the correspond- 
ing abstract domains. So nested tuples are allowed. This is important as now 
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Basic = {d, n} where d'Qn 
D2K = D2T ~ ^ 2/3 = Basic 

X ... X D2trn 

B2t\—3t2 ^2 Process ti ^ B)2t2\ 

D2y/3.t ~ D2t 



Fig. 7. Abstract domains for the refined analysis 



H2 P 2 = P2{v) 

|fe ]2 p 2 =d 

[(* 1 , . . . , Xm )]2 P ‘2 = ([a:i ]2 P 2 , . . . , \Xm \2 P2) 

\C Xi. . . Xm\2 p2 = LI Ctti(,\xil2 P2) where Xi U 

i 

[e x\^ p2 = ([e]2 P2) (H2 P2) 

\op Xi . ..Xmj2 P2 = ( 7 top(d)) ([a:i]2 P2) ■ ■ ■ ([a:m]2 P2) where op :: tap 

Ip#®]2 P2 = (M2 P2) (W2 P2) 

|Au.e]2 p2 = \z £ f?2t„.|e]2 p2 [u !->■ z] where v :: 
fprocess v — >■ e]2 p2 = \z £ D2tr-\el2 p2 [n z] where v :: 

|merge]2 P2 = Az € Basic.n 

[let V = e_in_e^p2 = [e']2 P2 [u e-;. [e]2 P2] 

[let rec {vi = a} in e']2 p2 = \e\ (Ap Lp2 [vi ^ {ei}^ P2 D) 
[case e of (m , . . . ,v„) e'Jj P2 = [e']2 P2 M i->- 7 Ti([e ]2 P2)] 

^ f 7 t(?i) if M2 P 2 = n 

lease e of Ci Vij p2 = \ |J |ei]2 p2i otherwise 

where p2i = p2 [vij ^ 7tij{d)],Vij w Uj , a w t 

^ , f lt{n) if M2 P2=n 

[case e of fci ei]2 P2 = < |J [gijj p2 otherwise 

where d w t 



Fig. 8. The refined analysis 



functions/processes are interpreted as abstract functions. If a process has several 
output channels and any of them returns a function/process, we would like to 
maintain the information provided, so we don’t apply the flattening. In Figure 
|§]the new abstract interpretation is shown. 

Now the interpretation of a tuple is the tuple of abstract values of the com- 
ponents. The interoretation of a function is an abstract function that takes an 
abstract argumenlLJ and returns the abstract value of the body. So application is 
interpreted as function application. In the recursive let the fixpoint can be com- 
puted by using the Kleene’s ascending chain, starting with an initial environment 
where all the variables have J_2t as abstract value. 



^ In the examples, we will not write explicitly the domain of the argument when it is 
clear from the type of the function. 



14 



Ricardo Pena and Clara Segura 



a, , D 2 t ^ Basic ^ 

OlK = CtT = idsasic ~ ~ idsaaic 

...,em) = U ^'(‘i *-)(^) = (inW. ■ ■ ■ ,7*^ (&)) 

i '^Process t\ t2 (^) — 

QProcess ti *2 (/) = _ f ^ Pl2ti -7*2 (^) if & = ?T- 

= at2(/(7ti(rf))) 7ti^*2W " j A2 e D2ti.7t2(ati(^)) if 6 = d 



Fig. 9. Abstraction and concretisation functions definition 



In this new analysis we need two functions conceptually similar to the flat- 
tening operator and to the adaptation of an abstract value to a type. These 
functions are at, called the abstraction function, and 7 t, called the concretisa- 
tion function, both defined in Figured They are respectively used in constructor 
applications and in case expressions. 

Given a type t, the abstraction function takes an abstract value in D 2 t and 
flattens it to a value in Basic. This is necessary in constructor applications as a 
single basic abstract value represents the whole structure. 

Given a type t, the function 74 unflattens a basic abstract value and produces 
an abstract value in D 2 t- This function is used in an algebraic case expression. 
The discriminant has a basic abstract value, as we have flattened all the values 
of the components. We have to recover the values of the components in order to 
analyse the right hand sides. But we have lost the information, so the only thing 
we can do is to give a safe approximation of those values. This is what jt does: 
given a basic value, it gives a safe approximation to any abstract value that the 
component could have had, considering how the flattening has been done. 

The functions are mutually recursive. The idea of the abstraction function 
is to flatten the tuples and apply the functions to the unflattening of d for the 
argument’s type. The abstraction function loses information. As an example, if 
t = Int — >■ Int, at(Xz.z) = at{Xz.d) = d. In Figure [TOl we show the abstraction 
function for the type {Int ^ Int) — >■ Int — >■ Int. 

The idea of the concretisation function is to obtain the best safe approxi- 
mation to determinism and non-determinism. It tries to recover the information 
that the abstraction function lost. The function type needs explanation, the rest 
of them are inmediate. As we have said before, a function is deterministic if 
it produces deterministic results from deterministic arguments. If the argument 
is non-deterministic, the safer we can produce is a non-deterministic result: It 
is like an ‘identity’ function. So, the unflattening of d for a function type is a 
function that takes an argument, flattens it to see whether it is deterministic or 
not and again applies the concretisation function with the type of the result. As 
an example, if t = Int — >■ Int, "ft{d) = Xz.z. In Figure HU] we show the concreti- 
sation function for the type {Int — >■ Int) — >■ Int — >■ Int. The unflattening of n 
for a function type is the function that returns a non-deterministic result inde- 
pendently of the argument (it is the top of the abstract domain). For example, 
if t = Int — >■ Int, jt{n) = Xz.n. 
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Fig. 10. Abstraction and concretisation functions for t = {Int — >■ Int) — >■ Int — >■ Int 



We have proven that these functions are monotone and continuous and that 
they are a Galois insertion [NNH99j . i.e. af7t = idsasic and ycOt 3 iduat- This 
means that given a basic abstract value b and a type t, 7 * (6) gives the best safe 
approximation to those abstract values that abstracted to b. So, the abstract 
values below 7 t(d) are deterministic. 

In the primitive operator application we have considered, as we did before, 
that the primitive operators are deterministic, choosing as abstract value the best 
safe approximation corresponding to the type of the operator 7top(d). Another, 
more accurate, option would have been to include in an initial environment the 
abstract values of all the primitive operators. 

Using this analysis in the example of Figure we obtain more accurate 
information than in the first analysis. We have assumed that g, h and r are 
deterministic functions. But now we have to provide abstract functions as their 
abstract values. We assume that their abstract values are: = Xt.Xs.tUs 

and r* = Xt.t. These are lint^int^inM) and jint^intid), that is, the ‘biggest’ 
deterministic functions of the corresponding types. Then, the abstract value of 
manager is manager'^ = At.(7Ti(t) U 7T2(t), tti ( t) U 7r2(t), 7ri(t)). This means that ys 
has as abstract value = (n, n, 1) where I is the abstract value of the argument 
ts. So the abstract value of rw is = Xw.Xl.l. This result tells us that the 
abstract value of the worker process is ignored and that if ts is deterministic 
{I = d), then the result of the function is deterministic as well. 
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6 Related Work 

The first analysis presented in this paper has been expressed using first a type 
annotation system and afterwards an abstract intepretation easily extensible 
to a more powerful analysis. In recent years typed based analyses have been 
widely used for several reasons such as their better efficiency and their adequacy 
when the information being looked for is preserved across transformations. For 
example in ITWM951 a type based analysis is developed to detect values that 
are accessed at most once. In [IW,T 99] type polymorphism and user-defined data 
types are added. The language being analysed is a second order polymorphic 
A-calculus extended with some Core constructions. The analysis annotates the 
types with usage information. 

In IFFC.inO] C. Baker-Finch, K. Glynn and S. Peyton Jones present their 
constructed product result (CPR) analysis. The analysis pretends to determine 
which functions can return multiple results in registers, that is, which functions 
return an explicitly-constructed tuple. It is an abstract interpretation based anal- 
ysis where the abstract domain corresponding to a function type t\ — >■ ^2 is not 
the corresponding functional domain, but it is instead isomorphic to the abstract 
domain of the result’s type t 2 - Product types are interpreted as cartesian product 
of a basic abstract domain, so nested tuples are not allowed. Our first analysis, 
expressed as an abstract interpretation, follows the same ideas but for different 
reasons that have been already explained in the paper. 

The second analysis is a typical abstract interpretation in the style of 
| |BHA86| ■ where functions are interpreted as abstract functions. There, a strict- 
ness analysis is presented where the basic abstract domain is also a two-point 
domain (J_ E T). However, the analyses are rather different. As an example, 
let / :: (Int — >■ Int) — >■ Int be a function whose abstract interpretations in the 
strictness analysis and in the non-determinism analysis are respectively /® and 
/”. To find out if such function is strict in its argument we apply to 
that is, to Xz.J-'. If the result is J_, then it is strict in its argument; otherwise 
it may be non-strict. On the other hand, if we want to know whether it is is 
deterministic or not, we apply /" to ■jint^intid), that is, to Xz.z: If the result is 
less than or equal to "fintid) (that is, it is equal to d) then it is deterministic; oth- 
erwise it may be non-deterministic. For example, Xg.g {head{merge^[[0], [1]])) 
is strict in its argument but it may be non-deterministic, i.e. /* (A 2 ;.T) = T 
but /" (Xz.z) = n. Also, the abstract interpretation of primitive operators, con- 
structors and case expressions is different in each analysis. 

7 Conclusions and Future Work 

We have seen that non-determinism affects the definiteness of variables in the 
programs written in Eden: If a non-deterministic expression is evaluated in dif- 
ferent processes, the variable it is bound to will denote possibly different values. 
It would be desirable to warn the programmer about this situation, or to force 
the evaluation of such an expression so that all the occurrences of the variable 
have the same value. Additionally there exist sequential transformations that 
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are incorrect when non-determinism is involved. Such transformations should be 
applied only to those parts of the program that are sure to be deterministic. 

In this paper several analyses of different efficiency and power have been pre- 
sented. They detect when an expression is sure to be deterministic, and when it 
may be non-deterministic. The first one, both expressed as a type based analysis 
and as an abstract interpretation based one, is efficient (linear) but less accu- 
rate. The second one, an abstract interpretation based analysis, is more powerful 
but less efficient (exponential). One example has been given to compare their 
accuracy. The details regarding polymorphism are not explained in the refined 
abstract interpretation, but the idea is the same as in the first analysis: To rep- 
resent a polymorphic type by its smallest instance and afterwards to adapt the 
abstract value to the particular instance. This can be achieved by using functions 
similar to the abstraction and concretisation functions. 

Correctness of the analyses has not been proved, as still there is not a for- 
mal semantics for Eden. However a simplified version where some details are 
abstracted could be used to prove the correctness. Another interesting question 
is the relation between the two analyses presented. Intuitively, the first analysis 
is a safe approximation to the second one. It can be proved that it is in fact an 
approximation to a widening jCC92j of the second analysis. Both the introduc- 
tion of polymorphism and the relation between the analyses will be presented in 
a forthcoming paper. 

We have already said that the first analysis has linear complexity, while the 
second one has a exponential one, as functions are involved. However the second 
analysis is more powerful than the first one. Following the ideas in jPP93| an 
intermediate analysis could be developed so that it is more powerful than the 
first one but less expensive than the second one. The idea is to use a probing 
to obtain a signature for the function. Such signature is easily comparable and 
represents a widening of the function. This speeds up the fixpoint calculation, 
as the chain of widened approximations is shorter. The first analysis is in fact 
a particular case of probing, where all the arguments are set to ‘d’. The idea is 
to probe also the combinations of arguments where ‘n’ occupies each position. 
For example, in a function with three integer arguments, the additional probings 
would be (n,d, d), {d,n,d) and (d,d,n). 

Another alternative to improve efficiency in the second analysis could be to 
extend the type based analysis in the style of |GS00j so that it mimicked the 
powerful abstract interpretation with less cost. 
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Abstract. An effective execution model is a vital component of any 
general-purpose implicitly parallel programming system. We introduce 
SLAM (Spreading Load with Active Messages), an execution model 
which overcomes many of the problems with previous approaches. SLAM 
is efficient enough to operate at low granularity without hardware sup- 
port, and has other necessary properties. Compiling for SLAM presents 
an unusual set of problems, and we describe how this is done from UFO- 
Lite, a simplified version of the United Functions and Objects program- 
ming language. Linear speedups are obtained for a program with irreg- 
ular, fine-grain, parallelism on stock hardware. 



1 Introduction 

Despite many years of research, the space between instruction-level parallelism 
and fully distributed computation remains sparsely occupied. Compilers can par- 
allelise suitable loops automatically, and highly skilled programmers can exploit 
process-level parallelism, but these approaches only work in restricted applica- 
tion domains. The main problem is the difficulty of programming parallel ma- 
chines for complex tasks, especially ones whose dynamic behaviour is irregular 
and unpredictable. 

The situation would change dramatically if an ordinary application program- 
mer could write, in a suitable high-level language, a program with parallelism 
in a very abstract sense, and have the system automatically produce an efficient 
parallel program for the target machine. This goal of implicit parallelism has 
proved very elusive. It is helpful to start with a programming language which 
does not over-specify sequencing. Both functional and parallel-object-oriented 
languages are good candidates, and our work is based on the United Functions 
and Objects (UFO) language jll2j . which combines these two paradigms. The 
techniques described in this paper could, in principle, be applied to more conven- 
tional languages, although this would present additional engineering problems, 
some of which are indicated below. 

* The work described in this paper was funded by EPSRC grant GR/M10861 
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The gap between the programming language and the machine is usually 
considered too large to bridge directly. Various intermediate forms have been 
proposed, variously called “computational models”, “abstract machines”, and 
“bridging models”. Some intermediate forms are close to the semantics of the 
programming language, and need further mapping to reflect the properties of an 
actual parallel machine. Others have a straightforward implementation on a real 
machine; we will refer to these as execution models (EMs). 

Our experience has shown that getting the EM right is the most important 
prerequisite for practical implicit parallelism. Whether a higher-level intermedi- 
ate form is also necessary is a secondary issue. In fact, in our recent work we have 
abandoned our high-level intermediate form, Uflow |3], as the extra complexity 
it introduced proved largely unnecessary. The current compiler operates on a 
standard AST data structure instead. 

The rest of the paper is structured as follows. In the next section we review 
the properties which a successful EM requires, and briefly highlight some of the 
problems which need to be overcome. We then introduce the SLAM (Spreading 
Load with Active Messages) model, and show how it overcomes many of these 
problems. We then discuss the way in which SLAM code is compiled from the 
UFO-Lite programming language, and present some encouraging performance 
results. 



2 Review of Execution Models 

2.1 Ground Rules 

Firstly, we assume that our target applications are general-purpose, with unpre- 
dictable flow of control, and unpredictable amounts of parallelism at irregular, 
potentially fine, granularity. The benchmark used to obtain the results in this 
paper is a purely functional program, although the techniques used should also 
be applicable to parallel object-oriented programs. 

Secondly, we assume that the target hardware is a conventional parallel pro- 
cessor. We do not assume shared (or virtual-shared) memory, but we do as- 
sume high bandwidth, low latency interconnection between the processors. In 
other words, we are interested in exploiting parallelism “within one box”, not 
distributed computing. The results presented below are for shared-memory ma- 
chines, although SLAM is designed to be equally suitable for distributed-memory 
architectures. 

There has, of course, been considerable work on alternative parallel archi- 
tectures such as dataflow; see for example the survey in [5]. However, the cost 
of building novel machines is very high compared to machines built from com- 
modity processors. Few commercially available multiprocessors have arisen from 
this work. Our execution model requires minimal hardware support. It is also 
designed to work in a practical environment where exclusive access to processors 
is not guaranteed. 
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2.2 Properties Required of an EM 

In order to work properly, an EM needs to have a number of critical, interrelated 
properties. 

Ability to exploit parallelism revealed at runtime. We wish to deal with 
applications which cannot be statically partitioned into parallel processes. 
The compiler can detect potential parallelism, but the actual parallelism 
available is known only at runtime. 

Low overhead, even at fine granularity. Without special hardware, inher- 
ently fine-grain models such as datafiow or packet-based graph reduction are 
expensive. Models which rely on conventional “lightweight” threads can be 
expensive if granularity is low because of the costs of creating and switching 
between such threads. 

Effective load balancing. Under high parallelism the load balancing system 
must avoid unnecessary idle time by spreading work quickly. Under low par- 
allelism it should avoid trying to use more processors than necessary. In 
both cases it should avoid sending excessive numbers of messages between 
processors. Coarse-grain, thread-based models tend to have problems in the 
former case, and fine-grain models in the latter. 

Locality. The model should preserve, as far as possible, the locality present 
in serial code. As well as preserving good cache hit-rates etc., the model 
should try to ensure that related data is held in the same memory as often 
as possible so that the number of remote data accesses (RDAs) is reasonable. 
Fine-grain models often have poor locality properties, and if fine-grain tasks 
are processed in a FIFO order, poor locality and excessive resource usage 
result [516] . 

Ability to hide remote data access latency. We do not want to be limited 
to machines with shared memory, or where remote data access latency can be 
neglected. Remote data access is a problem in thread-based models, because 
doing a full thread switch to cover an RDA is expensive. 

Ability to combine load balancing and data-following. It should be pos- 
sible to take the computation to the data where appropriate. For instance, 
in an object-oriented language, it should be possible to execute a method on 
an object in the processor where the object is. This is difficult in coarse-grain 
models because many method calls on different objects are contained within 
the same thread. 

An important alternative to the SLAM model is a coarse-grain stack-based 
evaluation model. Such models are relatively easy to compile for, and there is 
essentially no overhead if there is no parallelism - normal sequential execution 
occurs. Lazy task creation (LTC) | 7 ] is an essential feature of such models. In 
LTC, at each point where a task can be spawned, if the machine is busy at the 
time, a packet representing a potential task is created. This can be called upon 
if the activity level subsequently drops. LTC reduces idle time and improves 
granularity, because on average larger tasks are selected (by taking them from 
as far down the stack as possible). 
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A number of LTC-style models have been investigated at Manchester, e.g. 
[I8I9| . but the results were never totally convincing. The dynamics of such systems 
are complex, the overheads are significant when granularity is low, and it is 
difficult to take the computation to the data where this is desirable. Nevertheless, 
an LTC-based approach is probably the only viable alternative to SLAM, at least 
on conventional hardware. 



3 SLAM 

3.1 Overview 

SLAM (Spreading Load with Active Messages) requires no special hardware 
support, and should be efficient on a range of parallel machines. It is based on 
active messages CO], which basically consist of a code pointer and some data. 
The distinctive feature of active messages is not what they are, but the way 
they are used. The requirement on the hardware is that when an active message 
arrives at a processor, it should be possible to execute it directly, in user mode, 
by simply jumping to the code with no operating system intervention. If this 
criterion is met (trivial for shared memory; |10j discusses distributed-memory 
implementations), construction and execution of active messages is very cheap. 
The basic properties of SLAM are: 

— Everything - real computation, load balancing, housekeeping and (if neces- 
sary) remote data access is done with active messages. 

— Each physical processor holds a stack of active messages (the SLAM stack) 
and repeatedly executes the active message on the top of this stack. There is 
exactly one thread running in each processor; we refer to the activation stack 
of this thread as the “C stack”, since the runtime system is implemented in 
C. 

The use of a stack avoids heap management overhead and controls paral- 
lelism. Note that we do not create a separate thread with its own stack when we 
execute an active message; we simply execute code using the existing C stack. 
This means that the “lifecycle cost” of an AM is very small, and so we can 
exploit parallelism at a much lower granularity than a thread-based model. The 
downside is that an aetive message cannot be suspended (since there is no stack 
to maintain its state on) - once it starts it must run to completion. Situations 
which would be dealt with by suspension in a thread-based model can be handled 
by creating further AMs to execute the code required after the suspension, but 
this is messy to compile for, and we try to avoid such situations where possible. 

3.2 Details of the Model 

An active message is implemented as a packet consisting of a code pointer, a 
return address, arguments, and a header which includes the size of the packet 
and a count of the number of results it is waiting for; if this count is zero the 
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packet is active. This “header” is actually at the bottom end of the packet, for 
reasons explained shortly. 

A packet can be in one of three states: executing, active (ready to execute) 
and waiting (for results from other packets). Note that there is no suspended 
state; once a packet starts executing it runs to completion. 

Each processor executes the following driver loop: 

while (not terminated) 

{ if (message buffer not empty) put message on IDS; 
if (packet at TOS is active) execute it; 
else try to steal work from other processors 

> 



At the bottom of each stack there is a packet which contains code to look 
for work if the processor is idle. This means that there is always a packet on the 
stack, and the driver loop merely has to check whether it is active. 

There are several possible variations. For instance, if the packet at TOS is 
not active (because it is waiting for results from other processors), this imple- 
mentation tries to steal work from elsewhere, and if it fails it busy-waits. An 
alternative is to search down the SLAM stack for an active packet, haul it up to 
the top, and execute it. Early experiments using hand-generated code indicated 
that the extra complexity and resultant loss of locality was not worthwhile. 

It is even possible to eliminate the explicit driver loop altogether, and have 
the execution of each packet end with a jump to the next. In fact, the original 
SLAM model (as conceived by the third author) worked this way. These issues 
are discussed in d!, which also describes the interface to which generated SLAM 
code conforms, and other details. 

When work is exported from one processor to another, the active packet near- 
est the bottom of the SLAM stack is taken. Hence execution within a processor is 
LIFO, but distribution of parallelism across processors is FIFO. This improves 
both locality and the probability that an exported packet will represent a sig- 
nificant amount of work (because it is nearer to the root of the execution tree). 
This local-LIFO-global-FIFO execution order is extremely important 0. The 
header is at the bottom of the packet for the benefit of this mechanism, which 
otherwise could not interpret the contents of the stack. Each exported packet is 
replaced with a no-op, so that the SLAM stack can safely unwind back past it. 



3.3 Load Balancing 

Load balancing is implemented via active messages, but the details of the al- 
gorithm used are independent of SLAM. The current implementation uses a 
receiver-initiated work-stealing scheme. Such schemes tend to behave poorly un- 
der low parallelism, because they generate traffic to look for work which is not 
there, and handling these messages distracts the processors which do have work. 
This can be a significant problem even for programs with good overall paral- 
lelism; they often have sequential tails (e.g. simply producing the results) which 
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cause the load balancing to become unstable enough to seriously damage the 
overall performance. 

We compensate for this by adopting a conservative notion of when a processor 
is “busy” enough to be asked for work, namely that it must execute a prescribed 
number of consecutive tasks without an idle cycle or a load balancing message 
intervening. (This is cheaper to track than the natural measure of activity for 
SLAM, i.e. the number of active packets on the stack.) Resetting the count when 
a load balancing message is processed prevents the busy processor from being 
flooded with requests. Effectively we have a control system with feedback and 
we are adding hysteresis to stabilise it. 

An alternative is to use an adaptive symmetric load balancing scheme m. 
which is specifically designed to work well in both high-parallelism and low- 
parallelism situations. It appears that the extra complexity of such a scheme 
is unnecessary, since the results below suggest that the current scheme is ade- 
quate. All the evidence available suggests that, for purely functional code, the 
load balancing strategy is application independent. However this needs to be 
re-examined when stateful objects are considered. 

3.4 Other Issues 

Garbage collection is not currently implemented, but is obviously essential to a 
“real” implementation. We intend to use one of the schemes designed elsewhere, 
since we are not specifically interested in parallel garbage collection. Active mes- 
sages can of course be used to implement the necessary communication between 
processors. 

There are a number of possible ways of dealing with remote data access in 
SLAM. For instance, in an object-oriented setting, a packet represents a method 
call on an object and can be sent to the processor where that object is. Con- 
versely, active messages provide an efficient way of fetching copies of remote data. 
There are a number of issues to resolve, and the tradeoffs between approaches 
are architecture-specific. This is an area of future research, and is not discussed 
further here. 

The results presented below are for shared-memory machines. Explicit mes- 
sage passing between processors is only used for load balancing; RDA and return- 
ing of results is done using the shared memory. Implementing a message-passing 
system on top of shared memory hardware may seem a little odd. One advan- 
tage is that the system can cope with very fine-grain parallelism, as exhibited 
by the test program described below. Also, when mutable objects are involved, 
taking the computation to the data, even on shared memory, avoids both consis- 
tency and performance problems. Pragmatically, using shared memory provides 
us with a test-bed for most aspects of SLAM, without the extra complexity of 
dealing with RDA. 

The primary consideration in the shared-memory case is to make best use 
of the cache hardware. In particular, unnecessary sharing (real or false) leads to 
disastrous loss of performance. The processors each allocate from separate areas 
of store, and the runtime system is carefully tuned. For instance, data structures 
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which are always written to and read by the same processor are carefully aligned 
with the (L2) cache lines. 

4 Compiling for SLAM 

4.1 Overview 

Compiling for SLAM presents no fundamental theoretical problems, but is very 
complex in practice. “Real world” requirements such as separate compilation 
and dynamic linking would present significant additional problems, and we have 
simplified the task in two major ways: 

1. The source language, UFO-Lite, is a hybrid functional/object-oriented lan- 
guage with semantics which make detecting parallelism easy. UFO-Lite is a 
greatly simplified version of UFO, with many features omitted. Like UFO, 
UFO-Lite is strongly typed and the functional part is strict but, for instance, 
UFO-Lite does not have higher-order functions - they are simulated using 
objects when necessary. It is intended as a vehicle for exploring implicit 
parallelism, not as a programming language for real applications. 

2. Compilation takes place under a strong closed- world assumption that the 
complete source code for a program is available to the compiler throughout. 
(Cf. Java, where the ability to load classes at runtime presents fearsome 
challenges for optimisers etc.) 

The compilation process will be illustrated using a naive quicksort function. 
This has little useful parallelism in practice (except perhaps for extremely large 
data sets) and the results presented later are for a more substantial program 
with more usable parallelism. 

UFO-Lite source is as follows: 

quicksort (input : List [String] ) : List [String] is 
if input . length <2 then input 
else { 

pivot = input. head; 

// filter is a sequential function which does the comparisons 
(left, right) = 

filter(input .tail, pivot, [] : List [String] , [] : List [String] ) ; 

return 

quicksort (left) ++ [pivot] ++ quicksort (right) 

} 

fi 



Clearly UFO-Lite is a rather primitive language, and features such as collec- 
tion comprehensions and type inference would reduce the function to a couple 
of lines. However, this is just a matter of syntactic sugai0 and having everything 

^ Apart from type inference, which we consider inappropriate for a hybrid language, 
for reasons discussed elsewhere [l]. 
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explicit makes it easier to correlate the source with the output (and intermediate 
versions) produced by the compiler. 

The above code uses only the built-in classes List, String and Tuple2. A 
simple example of a user-defined class (used in the test program described below) 
is: 

class Domino (front , back: String) 
toString: String is front ++ "I" ++ back 

// The constructor is implicit and there is no ‘new’ keyword, 
reverse: Domino is Domino (back, front) 

matches (other : Domino): Bool is this .back . equals (other . front) 
end // Domino 

UFO-Lite also has (single) inheritance and dynamic binding. All examples 
discussed in this paper are purely functional, but UFO-Lite, like earlier version 
of UFO, also has “stateful objects” which allow explicit state to be manipulated 
in a disciplined way. 

The compiler operates on a simple AST representation of the program. It 
consists of the following phases: 

Parsing. The AST is built by a conventional parser generated using the JavaCC 
tool. 

Semantic checking. The program is checked for type correctness and other 
semantic constraints, so subsequent phases can assume that the program is 
well- formed. 

Optimisation. Very little is currently done in this phase at present; minor opti- 
misations are performed on blocks of value definitions, because this simplifies 
the SLAM partitioning process. A wide variety of optimisations were imple- 
mented for a previous version of UFO |13| and could therefore be applied 
here. However, optimisation is not really relevant to our current purposes. 
Property determination. The main purpose of this phase is to gather statis- 
tics on the estimated runtime costs of each part of the program, as described 
below. Other properties can also be determined; for instance it is useful 
to identify functions which do not create any objects, since this simplifies 
tracking of roots for GC. The syntax tree is annotated with properties as 
appropriate. El 

Call graph building. A separate data structure representing the call graph of 
the program is built from the AST, and compressed so that it shows only 
the basic recursive structure of the program. 

Execution mode determination. (EMD) The compressed call graph is used 
to determine the SLAM execution mode (inline, active, or waiting) for each 
function call. This information is then mapped back onto the main AST. 

^ This phase and the optimisation phase could be iterated, but while they are only 
done once each, the properties determined need to be those of the optimised program. 
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SLAM partitioning. The AST is transformed to include various SLAM- 
related operations, according to execution mode information. 

Code generation. The target code is ANSI C, including calls to SLAM library 
functions. Rather than output text directly, we generate an intermediate 
data structure called Abstract Imperative Code. This contains abstractions 
for the basic constructs of an imperative language, such as assignments, 
conditionals etc. and related utilities. The rest of the compiler knows almost 
nothing about the form of the output. This decoupling makes it easier to 
produce correct output, and also allows for the possibility of different output 
forms, such as Java or JVM code in the future. 

If ordinary sequential C code is required, the SLAM-related phases are sim- 
ply omitted. The phases relevant to SLAM are discussed in more detail in the 
following sections. 

4.2 Property Determination 

The primary purpose of this phase is to gather statistics about the expected 
execution times of various parts of the program. Our conjecture, supported by 
results so far, is that a crude static estimate of execution time, supported by a 
simplistic cost model, provides sufficient information for the following Execution 
Mode Determination phase to make sensible decisions about where we should 
and should not attempt to exploit parallelism. 

In principle, given any expression of the form /(ci, . . . ,e„), each of the e* 
expressions can be evaluated in parallel as a separate packet. However, in many 
cases the expressions will be trivial, and parallel evaluation is only appropriate 
if at least two of them involve significant computation. It is therefore clear that 
any sort of plausible partitioning requires at least some idea of the sizes of 
computations. 

It is, of course, impossible to know how long a computation will take, except 
for trivial cases, because of conditionals. In UFO-Lite, we have both explicit 
conditionals and dynamic binding. 

However, in both cases we can obtain upper and lower bounds on the amount 
of computation involved (for recursive computation the upper bound will be 
infinite). The upper bound (U) is the maximum of the upper bounds for the 
different possibilities and the lower bound (L) is the minimum of the alternative 
lower bounds. For a dynamically bound function, the bounds are calculated over 
all the alternative implementations. (This is possible because of our closed-world 
assumption.) 

Since the cost of a call depends on the cost of the function called, in the 
presence of recursion the statistics gathering process has to be iterated to a 
fixedpoint. 

Clearly this is extremely crude, but it does enable us to identify some impor- 
tant special cases, for instance trivial computations with small upper bounds, 
and “banker” computations with large lower bounds. For most interesting cases, 
the lower bound is low and the upper bound is very high (often infinite). Bet- 
ter information could be obtained by using runtime monitoring, along the lines 
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suggested in [14]. However, it is interesting to see how well we can do with 
purely static information. Since SLAM is inherently very efficient, very crude 
information may be good enough. 

We therefore simply guess the expected cost, E, of a computation as some 
function of L and U. The guess currently used is 

if U < infinity then (L+U)/2 
else L * inf inityf actor 

where inf inityf actor is an arbitrarily chosen constant. 

This enables us to distinguish trivial from non-trivial computations, but we 
also need to distinguish parallel ones from sequential ones. The sums above 
return approximately the same values from f (g(h(x) ) ) and f (g(x) ,h(x) ) and 
clearly the difference is important. 

In addition to the statistical information, it is also necessary to analyse the 
underlying structure of the program. 



4.3 Call Graph Building 

The call graph is a cyclic graph with the following types of nodes: 

Leaf nodes represent computations deemed trivial, such as constants or calls to 
primitives. 

Seq nodes represent computations which have to be done in sequence, e.g. data- 
dependent local definitions. 

Par nodes represent computations which can potentially be executed in parallel. 

For instance, a binary operator is mapped onto a Par node. 

Alt nodes represent alternative computations, due either to explicit conditionals 
or to dynamic binding. 

Call nodes represent calls, which in turn refer to the function being called. 

The call graph is built from the AST and then recursively compressed until 
all leaf nodes are eliminated. What is left is a “bare-bones” view of the recur- 
sive structure of the program (plus any non-recursive components deemed large 
enough to be interesting). For instance, the call graph for the quicksort function 
reduces to 

Seq(Call:filter, Par (Call: quicksort, Call : quicksort) ) . 

The uncompressed graph has an Alt node representing the conditional, but this 
disappears because the condition and one alternative are leaves. 

The call graph is not theoretically necessary, as it contains no “new” infor- 
mation. However, building it has several practical advantages: 

— It makes the EMD phase much simpler, since this operates on a graph with 
the four relevant node types, rather than the 30 or so in the main AST. 

— It makes it easier to see what is going on, both in the current research and 
potentially as the basis for a visualisation tool for end-users. 
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— It potentially provides the basis for other forms of analysis of the program. 
For instance, in a binary recursion such as the quicksort function, the base 
case must occur half the time. The call graph provides a good data struc- 
ture for spotting such patterns, which can in turn be used to make a more 
“educated” guess about expected costs0 

4.4 Execution Mode Determination 

The execution mode is one of: 

Inline. The node will be translated to ordinary sequential C. 

Active. The node will be translated to an active SLAM packet. 

Waiting (k). The node will (in principle) be translated to a packet waiting for 
the results of k other packets. 

The active parts of a graph are dependent only on the inline nodes while the 
waiting part is in general dependent on both the active nodes and further inline 
nodes. 

In the current scheme, active nodes are always call nodes in the graph. These 
may or may not correspond directly to calls in the source program, depending 
on the optimisations performed earlier. 

EMD is first performed on the compressed call graph, annotating it as re- 
quired. For instance, in the quicksort example, the two recursive calls to quicksort 
are marked as active, and the Par node which combines them as waiting for two 
results. 

The execution modes of the key call nodes in the compressed call graph 
are then mapped back onto the main AST, and a sweep across this fills in the 
execution modes of the rest. 

The combination of statistical and structural information guarantees that no 
attempt is made to parallelise expressions with no real parallelism. 

4.5 SLAM Partitioning 

This is the key stage, and the most complex. Only a brief outline is given here. 
The code resulting from partitioning the quicksort function is shown in outline 
here: 

{ 

Unpack stack pointer(SP), return address(RA) and argument ( input ) 
if length of input < 2 { 

return value of input to RA 

} 

else { 

Sequential code to get pivot and call filter; (1) 

® Although this requires a less compressed graph than that shown, since we would 
need to preserve the Alt node representing the conditional. 
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Construct waiting packet to append the results together 
Euid return the answer; (2) 

Construct active packets for the two recursive quicksorts, 
returning to slots in the waiting packet; (3) 

} 



return new value of SP ; 

} 



The corresponding C code appears in appendix A. 

The code segments (1) - (3) correspond to the inline, waiting, and active parts 
of the graph. Note that the waiting packet appears before the active packets, 
because they must return values to it; this part of the code is effectively reversed 
from its normal ordering. 

A naive partitioning would create two waiting packets for the expression 
quicksort (left) ++ [pivot] ++ quicksort (right) 

- one for each append. The overhead of this is unacceptable, and instead all the 
waiting code is gathered together in a single auxiliary function, so there is only 
ever one waiting packet corresponding to a set of active packets. 

Conditional computations typically create packets in some branches and not 
others. The partitioning process preserves the branching structure of the original 
code. Branches which do not create packets return their results directly. Ones 
which do are recursively partitioned, so that each such branch contains some 
active packets and a waiting packet. 

5 Results 

5.1 An Example Application 

Finding suitable benchmarks is problematic. The traditional toy programs give 
the usual good speedups but lack credibility. Translating “real” programs into 
UFO-Lite would be a substantial effort and the results would be hard to inter- 
pret. Existing benchmark sets are intended for other purposes (e.g. assessing the 
performance of sequential lazy FL implementations) or require features such as 
arrays which UFO-Lite does not have. 

We chose to invent our own benchmark, one which is relatively simple but 
which provides significant challenges to the compiler and the runtime system, 
and has yielded a lot of useful information. 

We define a domino to be an object containing two strings. Two dominos 
match if the “front” string of one equals the “back” string of the other. Dominos 
can be reversed, except for “doubles” whose front and back are the same. The 
program, given a collection of dominos, returns all chains of matching dominos 
which use the complete collection. For instance, for the set: 

one I two three I two two I two 



there are two solutions: 
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one I two two I two two I three 
three I two two I two two I one 

The data used to obtain the results below follows the same pattern up to 
fifteen. 

The program operates by exhaustive, brute-force search, implemented by 
recursions along lists of dominos, and is certainly not the most efficient solution 
to the problem. However, it is large enough (around 50 lines of UFO-Lite) to 
show some interesting behaviour, but small enough for that behaviour to be 
understandable. The program presents a significant challenge to any scheme of 
implicit parallelism because: 

1 . As well as being dynamic and irregular, the parallelism in this application is 
at a very fine grain; most branches of the search tree terminate immediately 
because most pairs of dominos don’t match01n fact, it is not a priori obvious 
that the program has much usable parallelism at all. 

2. It is very store-intensive; lists of dominos are created and discarded at a very 
high rate. 

The system must therefore be efficient at fine granularity and at the same 
time preserve locality, so that most data is read by the processor which wrote it. 

The results presented below are for two slightly different versions of the pro- 
gram. The first (the S-version) represents the strings as standard UFO Strings. 
This means that comparisons between them cost almost nothing (they are calls 
to the C strcmpO function). This is arguably unrepresentative of more realis- 
tic applications, where we would expect a higher proportion of “real work” to 
store allocation (e.g. matching molecular structures or evaluating game posi- 
tions rather than comparing strings). The second (the L- version) represents the 
strings as list of characters, and so does rather more work in the comparisons, 
although it is still very fine-grain. 

Results were obtained on a 4 processor SGI Challenge shared memory mul- 
tiprocessor and on a 16 processor SGI Origin 2000 virtual-shared-memory ma- 
chine. All code was compiled using the SGI C compiler at its highest optimisation 
level (-03). Times are user times in seconds as reported by /bin/time. Stopwatch 
times and internal timings from the C library clock () function gave very similar 
results; the user times are slightly less affected by other activity in the machine. 
All numbers reported are the mean values derived from at least 3 runs. 



5.2 Shared Memory Results 

Table 1 shows the results obtained on the Challenge. The execution times for 
the ordinary sequential versions are given in the top row. The table gives times 
for the SLAM versions on different numbers of PEs, and the relative speedup 

For data where this is not the case, the number of solutions is very large, and the 
program becomes I/O bound. 
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Table 1. Performance results on a 4 processor SGI Challenge 



PEs 


S version (37.4) 


L version (63.9) | 




Time 


S (rel) 


S (true) 


Time 


S (rel) 


S (true) 


1 


52.9 


1.00 


0.71 


84.7 


1.00 


0.75 


2 


26.4 


2.00 


1.41 


41.6 


2.04 


1.54 


4 


13.2 


4.01 


2.83 


21.0 


4.03 


3.04 



(compared to SLAM on one PE) and true speedup (compared to the ordinary se- 
quential version) in each case. The sequential versions are compiled with garbage 
collection and all associated overheads disabled, since the SLAM versions do not 
have GC. Clearly the program does have useful amount of parallelism, something 
which is far from immediately obvious. The relative speedups are linear, or even 
slightly superlinear. On this machine the performance of the caches dominates 
performance. Having N times as much cache available by using N processors 
negates much of the cost of parallel execution. This of course excludes the over- 
head which is paid for SLAM even on one PE. This overhead can probably be 
reduced by further tuning, but in any case we are observing real speedups of 
around 3 on 4 processors. 



Table 2. Statistics for a 4 PE execution 



PE 


Cycles 


Packets 


Messages 


Store 


0 


984418 


984213 


108 


4929705 


1 


814913 


814286 


133 


5059410 


2 


1007754 


1006863 


231 


6193026 


3 


1046719 


1045827 


160 


6364242 



A sample 4 PE run (of the S- version) gives the information in Table 2. The 
number of SLAM packets executed is very close to the total number of cycles of 
the driver loop, indicating very little idle time. By contrast the number of mes- 
sages between processors is very small, showing that the load balancing is very 
stable. The final column shows the amount of store (in 32 bit words) allocated 
by each PE, illustrating how store-intensive the program is. In particular, note 
that the total memory use considerably exceeds the cache size. 

There is some variation between PEs, indicating that there was some other 
activity due to other users on the machine at the time. However, the system 
copes with this smoothly. 

5.3 Virtual Shared Memory Results 

The results in Table 3 were obtained on a moderately-loaded 16 processor SGI 
Origin 2000 virtual shared memory multiprocessor. The code was identical to 
that for the Ghallenge, except for the values of two constants used to fine-tune 
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Table 3. Results on a 16 processor SGI Origin 2000 



PEs 


S version (9.25) 


L version (16.4) 




Time 


S (rel) 


S (true) 


Time 


S (rel) 


S (true) 


1 


12.55 


1.0 


0.75 


20.1 


1.0 


0.8 


2 


6.3 


2.0 


1.5 


10.0 


2.0 


1.6 


4 


3.2 


3.9 


2.9 


5.1 


3.9 


3.2 


8 


1.8 


7.0 


5.1 


2.8 


7.2 


5.9 


10 


1.75 


7.2 


5.3 


2.4 


8.4 


6.8 



cache behaviour. Other activity on the machine limited the available resources to 
10 PEs. The individual processors are much faster than on the Challenge, as the 
sequential times show. The results are very similar to the shared-memory case up 
to 4 PEs, and further speedups are obtained up to 10. The drop-off in speedup 
for 8 and 10 PEs may be due to limited parallelism in the application, overly 
conservative load balancing or other implementation artifacts, competition from 
other users, or some combination. Further investigation is needed to determine 
how much each of these factors apply. 

5.4 Lessons Learned 

For a long period, speedups such as those shown above were unobtainable, even 
though the basic SLAM mechanism were working correctly. We learned two 
important lessons from this: 

1. Slight instabilities in the load balancing system are rapidly magnified as 
the number of processors increases, even from 2 to 4. Decent speedups were 
only obtained once the very conservative scheme described in section HOI was 
adopted. 

2. On these machines, every unnecessary instance of sharing (real or false) 
must be ruthlessly eliminated. There were numerous instances of this: the 
final bottleneck was a single shared counter, left over from the sequential 
List library, where it was used for monitoring purposes. One of the potential 
benefits of implicit parallelism is that the number of programmers who need 
to worry about such things is much reduced. 

5.5 Improvements and Extensions 

Obviously it is necessary to investigate larger and more realistic programs, in 
particular ones with dynamic binding, and also ones using stateful objects, where 
it will be necessary to ensure that the computation goes to the data rather than 
vice versa. We also hope to obtain results for larger numbers of processors, and 
in the long run different types of machines. 

The overhead of SLAM execution can be reduced further. A realistic target 
may be “breakeven” (real speedup on 2 processors) at a granularity of 100 ma- 
chine instructions, and it appears that the compiler can certainly guarantee this 
granularity. 
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Many improvements can be made to the compiler, for instance inlining to 
increase granularity. Garbage collection needs to be incorporated, and may be 
beneficial in improving locality. 

6 Conclusions 

We have shown that SLAM meets many of the requirements for an effective 
execution model for implicit parallelism. The UFO-Lite to SLAM compiler is 
able to expose parallelism in irregular recursive programs, and avoids generating 
trivial tasks by using simple static estimates of the costs of partial computations. 
Although the cost modelling is very imprecise, it appears to be good enough, 
because the underlying SLAM model can cope with relatively small granularities. 
We have demonstrated linear relative speedups and significant real speedups, on 
a program with complex dynamic behaviour. It must be admitted, however, that 
the results quoted are for just this single program, and that a wider spread of 
applications is needed to be fully convincing. However we have demonstrated, in 
principle at least, that implicit parallelism can be made to work on conventional 
parallel machines. 



A SLAM C Code for the Quicksort Function 



// SLAM functions take and return stack pointer values. 

SPTR _SLAM_quicksort (SPTR _sptr){ 

SPTR .blocks ; 

{ 

SPTR _sp; 

RA _ra; 

Did .input ; 

.sp = .sptr; 

// Unpack the stack pointer, return address and argument from the 
// SLAM stack. 

.unpack.2.1 (.sp, .ra, .input); 

// If length(input) <2... 

if (( .ListP.lengthC. input) < 2 )) 

{ 

// Return the value of input to the return address 
.retval (.input , .ra) ; 

} 

else { 

// Inline call to filter etc. .dollar.lhs.dummyS is an internally 
// generated identifier for the (left, right) tuple. 

Did .pivot; 

Did .dollar.lhs.dummyS; 
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Did _temp 10; 

SPTR _temp_waiting_ll ; 

_pivot = _ListP_head(_input) ; 

_dollar_lhs_duininy3 = _List_String_f liter (_ListP_tail(_input) , 

_pivot , null_ListP, null_ListP) ; 
_temp 10 = _ListP_cons (_pivot , null_ListP) ; 



// Set up a waiting packet to call the auxiliary function below. 

// LOCAL meEuis the packet will not be exported to smother processor 
// once it becomes active (in order to maintain locality) . 

// The EMPTY slots are where the results will go. 

_sp = _pack_5_l(_sp, _SLAM_aux_quicksort_4, 2, LOCAL, _ra, 

.input, EMPTY, _temp__10, EMPTY); 



,temp_waiting_ll = _sp; 



// Active packets for the recursive calls to quicksort. 

// FREE means these can be exported to other processors. 

// _mk_RA constructs appropriate return addresses. 

// INDEX.Oid accesses components of objects (in this case of tuples) 

_sp = _pack_2_l(_sp, _SLAM_quicksort , 0, FREE, 

_mk_RA(_temp_waiting_ll , 2), INDEX_0id( .dollar _lhs_dummy3, 2)); 
.sp = .pack.2.1(.sp, .SLAM.quicksort , 0, FREE, 

.mk.RA(. temp. waiting.il, 4), INDEX.Oid (. dollar. Ihs.dummy3, 3)); 

} 

.block3 = .sp; 

} 

// Always return the stack pointer. The C compiler tidies up the 
// extra identifiers etc. 
return .blocks ; 

} 



// Auxiliary function to append the results together. 

SPTR .SLAM.aux.quicksort.4(SPTR .sptr)-[ 

SPTR .blocks; 

{ 

SPTR .sp; 

RA .ra; 

Did .input ; 

Did .rval.quicksort.6; 

Did .temp 10; 

Did .rval.quicksort.7; 

.sp = .sptr; 

// Unpack the results.. 

.unpack.5.1 (.sp, .ra, .input, .rval.quicksort.6, .temp 10, 

.rval.quicksort.7) ; 

.pivot = .ListP.head(. input) ; 

// And return the result of appending them. 

.retval(.ListP.append(.ListP. append (.rval.quicksort.6 , .temp 10) , 

.rval.quicksort.7) , .ra) ; 



blocks = .sp; 
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} 

return _block8; 

} 
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Abstract. The Erlang Verihcation Tool is an interactive theorem prover 
tailored to verify properties of distributed systems implemented in Er- 
lang. It is being developed by the Swedish Institute of Computer Science 
in collaboration with Ericsson. 

In this paper we present an extension of this tool which allows to reason 
about the Erlang code on an architectural level. We present a verification 
method for client-server systems designed using the generic server imple- 
mentation of the Open Telecom Platform. For this purpose, we specify 
a set of transition rules which characterize the abstract behaviour of 
the generic server functions. By this means we can reason in a parti- 
tioned way about any client-server application without having to con- 
sider the concrete implementation details of the generic part, which sim- 
plifies proofs dramatically. 

The generic server architecture is just an example, and the technique 
extends to many other generic components. Moreover, the idea of con- 
sidering standard components to reason on the architectural level of a 
concrete implementation can also be explored when using other verifica- 
tions tools for Erlang or in the context of another language. 



1 Introduction 

The high quality demands on software for telecommunication applications may 
partly be ensured by the use of formal methods in the design and development. 
By the high degree of concurrency in those applications, testing is often not 
sufficient to guarantee correctness to a satisfactory degree. Verification, namely 
formally proving that a system has the desired properties, is therefore becoming 
a more and more widespread practice (see [CW96] for an overview). 

Although a complete formal specification of an application would probably 
be one of the best ways to ensure its correctness, in practice the descriptions are 
rather informal, written in natural language in combination with some fragments 
of, for example, the Standard Description Language SDL jSDL93] . Reasons for 
the absence of a complete formal specification can be found in the fact that the 
specification changes several times during development, triggered by experiments 
with a release or by changed requirements. It is felt too time consuming to modify 
both the formal specification and the code. Even the informal specification tends 

M. Mohnen and P. Koopman (Eds.): IFL 2000, LNCS 2011, pp. 37-|5^ 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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to run out of phase with the actual implementation and is often only updated 
after a release of the product. 

Towards the end of a project, it is the running code that represents the 
best ‘specification’ of the implementation. Questions about its correctness are 
therefore best formulated in terms of this code: ‘is there a possibility that this 
finite-state machine implementation deadlocks?’, ‘does this server implementa- 
tion correctly respond to all possible requests?’. 

In order to find answers for these questions, one might abstract from the code 
(having the informal specification helping in this) and check the questions in an 
obtained model [( ;(Iljh4IHiichh| . If one takes this realistically, the verification 
is used for finding errors, more than for proving correctness. If the model does 
not fulfill a certain property, then this might indicate an error in the code. It is 
common practice to analyze the given trace that leads to the detected error and 
to check whether this is also a valid trace in the actual code. The latter need not 
be, since a model (which is an abstraction) neglects some, potentially essential, 
details. 

In general the constructed model depends on the property one wants to 
prove. Often one does not directly construct the final model, but many models 
are built, which are refinements of an initial rough model. The model is refined 
until either a detected error can be identified as a real error, or until one has 
enough confidence in the detailedness of the model to believe that the code is 
error-free. In this analysis method, finding a trace to an error can efficiently and 
automatically be performed by a model checker. The construction of a model 
and its refinements, including checking the trace in the real code, can often only 
be done by hand or with some minor computer assistance. 

Given that the code is the only available formal description of the software, 
as an alternative to building the models and checking the traces one can use a 
theorem prover to reason about the code directly. An interactive proof assistant 
for the purpose of verifying properties of programs written in the functional 
programming language Erlang [AV WWfIB] has been developed by the Swedish 
Institute of Computer Science (SICS) in collaboration with Ericsson. This Er- 
lang Verification Tool (EVT; [lA I )KChR] ) can be regarded as a tableau-based 
prover with proof rules for first-order modal logic in Gentzen-style, extended 
by rules that reflect the semantics of Erlang, rules for decomposing proofs about 
compound systems to proofs about the components, and rules for induction and 
co-induction |DFC98J FI The disadvantage that proofs have to be provided by 
hand should be put against the advantages of obtaining certainty that a property 
holds for the code and of the possibility to reason about unbounded data struc- 
tures, unbounded message queues, and dynamic creation of processes. Moreover, 
bugs are detected by the fact that proofs cannot be provided, and the attempt 
to prove the property usually clearly indicates a trace in the code which can be 
used as a counterexample. 

In this paper we present an addition to this verification tool which allows to 
reason about the Erlang code on an architectural level. In this way we provide 

^ EVT is available at ftp://ftp.sics.se/pub/fdt/evt/index.html. 
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a general abstraction, obtained automatically from the code, that is detailed 
enough to prove many different properties of the program. 

Many applications that we consider consist of several servers that commu- 
nicate with their clients. These servers are all implemented in a predefined, 
generic way. The generic implementation takes care of starting the server, re- 
ceiving synchronous and asynchronous messages, and providing debug and log 
information for maintenance purposes. We added the specification of the func- 
tions of this generic server as proof rules to the verification tool. More precise: 
based on the transition-system semantics of Erlang as presented in [Ere| , we 
provide rules which describe the possible transitions that any Erlang process 
evaluating the respective function can take, restricted by the shape of the envi- 
ronment if necessary. Thus, we abstract from the actual implementation of the 
server and concentrate on its specific behaviour instead, such that we can argue 
about any client-server application without having to consider the source code 
of the generic part. In this way we support a relativized style of reasoning which 
is based on the assumption that the concrete implementation of the generic mod- 
ule follows its specification. This abstraction is property-independent and can 
hence be used for all properties we are interested in, whereas we gain that we 
may skip many details, leading to much smaller proofs. 

The remainder of this paper is organized as follows. Sect. |2] describes the 
class of systems addressed by our approach, namely, client-server systems im- 
plemented in Erlang using the gen_server module. In Sect.|3]we give the abstract 
representation of some of the gen_server functions (the complete specification 
can be found in lANGOJ l. Their implementation in EVT is discussed in Sect. ID 
In Sect. [5] we address the correctness of our approach and conclude with some 
comparative remarks. 

2 Generic Client-Server Implementations 

Large software applications are built using a software architecture. Elements of 
such architectures are: databases, device drivers, finite-state machines, supervi- 
sors, monitors, servers, and many more. After putting the architecture together, 
the actual implementation of the components may start. Software engineering 
practice has taught that having all servers implemented in some general way is 
an advantage, both for development and for maintenance. Even better, when 
parts of the server software are already written and are used as the basis for 
all specific servers, it serves the correctness of the whole application, since the 
generic part is well developed and tested. Therefore, the Open Telecom Plat- 
form (OTP; [OTPj l. the set of libraries and design principles that comes along 
with Erlang, supports a standard, generic implementation of a server by provid- 
ing the gen_server module. This module implements several interface functions 
providing synchronous and asynchronous communication, debugging support, 
error and timeout handling, and other administrative tasks. In order to obtain 
the required specific server functionality the programmer provides an instantia- 
tion for this generic server. This instantiation consists of a separate module, the 
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so-called callback module, which contains (callback) functions that are invoked 
by the generic part of the server. Thanks to this software engineering practice 
we are able to easily abstract from the actual server implementation in the code. 

The typical flow of control in a gen_server-based client-server application is 
as follows. When a client process wants to synchronously communicate with the 
server, it uses the standard gen_server: call function with a certain message as 
an argument. The generic part sends the message to the server process and blocks 
the client. In the server process, another function of the generic part receives the 
message and forwards it to the application-specific part by calling a function in 
the callback module. This callback function should return the response and the 
new server state. The new state is stored in the server process, and the reply 
is returned to the client by the generic part of the server, completing therewith 
the synchronous event. 

In greater detail, the following single steps are taken. 

— To start the server process the gen_server: start function is called. This 
function creates a new process, the server process, in which a function is 
started that implements the server. The first thing this function does is com- 
puting its initial state by calling the init function in the callback module. 
After that, the process waits for a request from a client process. 

— The client uses the gen_server: call function to send a synchronous request 
to the server. The request is handled by the handle_call function in the 
callback module while the client is being suspended, waiting for the response. 
The current state of the server is passed as an argument to the callback 
function, which on its turn returns both a reply message and a new state. 
The reply message is sent to the suspended process. 

— Alternatively, gen_server: cast can be used to send asynchronous requests 
to the server. Here only the internal state of the server is changed according 
to the result of the handle_cast function in the callback module. 

— Both hcuidle_call and handle_cast can return a value indicating that the 
server should terminate. In this case, gen_server invokes the terminate 
function in the callback module to clean up before the process terminates. 

Clearly the flow of control as described above may look different when error 
situations occur, such as a server that cannot be started or a call that cannot 
be handled. In those cases some standard error handling is performed by the 
generic server. In addition several options can be provided to the standard calls 
in order to have them behave slightly different. 

The following example of a simple locker server implements a scheduler that 
arbiters the access to a single resource. It can be used by several clients at 
a time, communicating synchronously by executing function calls of the form 
gen_server: call(Server, request) and gen_server: call(Server, release) to 
request a lock and release it thereafter, respectively. 

The example is classical and the properties of interest are likewise (formulated 
for a server with arbitrary many clients), such as: no deadlock, no starvation, 
mutual exclusion. 
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For details on the syntax and semantics of Erlang see |AVWW96| . It is a 
concurrent programming language with processes that execute functions. Erlang 
is an eager, dynamically typed language with only a few data types. In this paper 
we use atoms (constants) which are denoted by lowercase symbols, tuples, and 
lists. Variables start with an uppercase character, except for the special variable 
which matches any value without getting bound to it (i.e., it is always a free 
variable) . 

The client function is trivially implemented in a module called client by a 
function with the same name that takes the process identifier of the server as a 
parameter (to establish communication). 

-module (client) . 

client (Server) -> 

gen_server : call (Server , request) , 
access_the_resource 0 , 
gen_server : call (Server , release) . 

The state of the server consists of a list of pending clients. More exactly, the 
client that currently has access to the resource is stored in the head, and all 
waiting clients are kept in the tail of this list. We start the server by evaluating 
gen_server: start([]) to initialize it with an empty list of pending processes. 

-module (locker) . 

-behaviour (gen_server) . 

init (Requests) -> 

{ok, Requests}. 

handle_call (request , From, Requests) -> 
case Requests of 
[] -> 

{reply, ok, [From]}; 

-> 

{noreply, Requests++ [From] } 

end; 

handle_call (release , From, [_ I Waiting]) -> 
case Waiting of 
[] -> 

{reply, done. Waiting}; 

-> 

gen_server : reply (hd (Waiting) , ok) , 

{reply, done. Waiting} 

end; 

handle_call(stop. From, Requests) -> 

{stop, normal, ok. Requests}. 
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terminate (Reason, Requests) -> 
ok. 

The gen_server: call function in the client causes the handle_call function 
in the callback module to be executed. Note that the return value (either ok or 
done) is ignored by the client process, i.e., no check is performed to see if the value 
really is the expected one. The return of a value is a synchronization mechanism 
which in this case is independent of the actual value. In this particular example 
we could have restricted us to asychrounous communication for releasing by 
using gen_server: cast instead of gen_server: call, but for readability we aim 
to concentrate on only one communication primitive. Also note that the server 
expects the clients to stick to the locking protocol. In other words, in this simple 
version we left out any effort to program defensively. For example, misbehaving 
programmers can crash the locker by sending a release without a previous 
request or by sending a message that is not recognized by the locker at all. 

On the level of the actual execution. Erlang supports only one way of commu- 
nicating messages, which is asynchronous. However, the gen_server implemen- 
tation ensures a synchronized behaviour: the gen_server: start function will not 
return before the init function has returned a state, and the gen_server: call 
function only returns if the haindle_call function returns a reply to the callee. 
In the gen_server module this synchronous communication is implemented by 
using the asynchronous primitives: a message is sent and a receive statement 
directly succeeds this output operation, waiting for the response. The main goal 
of the technique we present in this paper is to be able to abstract from the 
implementation of the synchronous communication. The fact that the message 
is read from the message queue in a certain way, that a timeout primitive is 
supported and all that, need not be of our concern. 

In this abstract setting, the given server implementing the locker can be de- 
scribed as a server that stores a list of clients which claim access to the resource. 
The number of clients is arbitrary and in properties or proofs we want to make 
no assumptions about an upper bound. Access is granted to the first client in 
this list, the other clients are suspended. Only after a release by the client that 
currently accesses the resource the client that is next in line gets access to it. 
The suspension of clients is implemented by not providing the reply immediately 
(returning {noreply, . . .}), but sending it later (gen_server: reply(. . . , ok)). 

3 The Verification Approach 

Already without our addition, the Erlang Verification Tool can be used to verify 
the Erlang code of an application which makes use of the generic server imple- 
mentatioru- When establishing such a proof, one has to follow the simulation 
of the synchronous communication by the underlying asynchronous implemen- 
tation. By the nature of the Erlang semantics this means that one should also 

^ Although the EVT tool lacks support for modules at the moment, one can combine 
the callback, the gen_server and, if present, some client module into one bigger 
module by little effort. 
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prove some properties about the message queues of the client(s) and of the server, 
which seems irrelevant given the knowledge how the server works. In particular, 
when many clients are involved, a lot of nondeterminism can be introduced in 
the proof by observationally equivalent traces. As such, the number of proof 
goals may be much larger than it seems strictly necessary. 

Another disadvantage in such a proof is that one gets confronted with details 
such as debug features implemented in the gen_server module. Although the 
verification is performed in a context where debug facilities are assumed to be 
disabled, one still generates an extra proof goal for testing the debug flag. This 
test is not atomic, and since we work in a concurrent setting, even those few 
steps cause duplication of work, since another action may be chosen in another 
process at that same time. 

In our approach we simplify the verification task by ignoring the concrete 
implementation of the gen_server module. We specify its abstract behaviour by 
making its syntactic constructs recognizable as keywords by EVT, and by adding 
appropriate transition rules to the proof system. Since these transition rules are 
of a general nature, they can also be used for implementing our approach in 
other tools supporting Erlang. The actual implementation of the rules in EVT 
is described in Sect. 3] 



3.1 Extending the Erlang Verification Tool 

An Erlang system is specified by a composition of processes, each represented as 
(e,pid,q), where e is the Erlang expression being evaluated, pid is the uniquely 
defined process identifier, and q is the mailbox queue in which incoming messages 
are stored. In order to have the tool recognize a generic server function call, we 
add these as special syntactic constructs to the set of expressions: 

e ::= ... | gen_server: start(ei, 62, 63) | gen_server: call(ei, 62) | 
gen_server: reply(ei, 62) | gen_server:wait(e) | 
gen_server: ready(e) | gen_server:busy(ei, 62) | 
gen.server: down(ei, 62, 63) | ... 

In this way, those function calls can be treated in a different way than the 
other function calls. The standard method is to search for a definition of the 
function, to substitute the arguments, and to continue with evaluating the body 
of the definition. The way the special function calls are treated is defined by 
an extension of the operational semantics which is defined by labeled transition 
rules (in the style of iFre] !. 

A reduction context is an Erlang expression r[-] with a ‘hole’ • in it, which 
identifies the position of r where the next evaluation step takes place. In this 
way, the rules for the actual expression evaluation have to be given only for 
exceptional cases, namely, when all parameters of an expression construct are 
values, i.e., have been fully evaluated. 
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For example, process creation is formally described by the following rulc[f]: 

(r[spawn(/, [t;i, . . . , Vn])],pid, q) 

{r[vid'],pid,q) || {f{vi,...,Vn),pid',e) 

Here, / is a (function) atom, vi, ... ,Vn are values, q is an arbitrary mailbox and 
e denotes the empty mailbox. Thus, a process evaluating a spawn function call 
has a transition to a system of two processes (|| denotes parallel composition) 
which have to evaluate the expressions r[pid'] {pid' is the return value of spawn) 
and f{v\, . . . , Vn), respectively. For the process identifiers pid and pid' we require 
pid' ^ pid. 

Assuming now that the set of reduction contexts has been extended accord- 
ingly to cope with the new syntactic constructs, we can formalize the intuitive 
meaning of the gen_server functions as described in the previous section as 
follows. Here we describe the starting of a server process and the handling of a 
server call; the complete specification is given in |AN00j . 

Starting a server is similar to spawning a process, but the continuation of 
the process depends on the evaluation of the init callback function. This is the 
reason for adding the special gen_server:wait construct: 



(r[gen_server: start(mo(i, arg, opt)], pid, q) 
(r[gen_server:wait(spid)], pzd, g) || {±n±t(arg), spid, e) 



( 2 ) 



Here, spid denotes a fresh (server) pid, and the server is created with an empty 
mailbox. The term gen_server:wait(spid) should not be treated as a normal 
form, since then reductions in the context r[-] would be allowed, but rather as a 
construct from which currently no transitions are possible (similar to a receive 
statement with an empty mailbox). 

According to the generic server description, the result of evaluating the init 
function should be a tuple with the initial server state as its second component. 
This state should be kept as part of the looping server (looping over: receiving 
a request, computing the answer and the next state, and responding). 



(r[gen_server:wait(spzd)],pj(i,g) || {{oii, state}, spid, sq) 

— >■ {r[{o^, spid}], pid, q) || (gen_server: ready(sfate), spid, sg) 

Note that the identifier of the process that started the server is not known to 
the server in this specification. Since pids of newly created processes are unique, 
this causes no problem in our setting. The starting process ‘remembers’ which 
server it has started (by the obtained process identifier spid). 

A call by a client can be handled by the server if it is in an idle state, denoted 
by the gen_server: ready construct. In this case, the server process invokes the 
handle_call callback function, and the client process is put into a waiting state 

® Actually, in the definition of the semantics a two-layer scheme is employed which 
separates expression-level from process-level steps. We will consider this distinction 
in Sect. 13 in greater detail. 
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until the request has been answered. Now, however, the server needs to store 
the pid of the calling client in order to be able to distinguish clients if several of 
them are waiting for the same server: 

(r[gen_server: ca.ll(spid, req)],pid, q) || 

(gen_server: ready(state), spid, sq) , , 

— ^ (r[gen_server:wait(spzd)], pzd, (j) || 

(gen_server:busy(handle_call(re(;, pid, state), pid), spid, sq) 

If the handle_call function yields a triple of the form {reply, answer, state}, 
then this answer is immediately returned to the waiting client, and the server 
changes into the idle state again: 



(r [gen_server: wait (spid)] , pid, q) j] 

(gen_server:busy({reply, answer, state} , pid) , spid, sq) 
^ {r [answer], pid, q) || 

(gen_server: ready(state), spid, sq) 



The fact that the process identifier is stored in the second argument in the server 
expression guarantees that the reply is received by the right waiting client. 

As can be seen in our locker example, the handle_call may also return a 
tuple of the form {noreply, state}. In this case the client process remains in the 
waiting state (and does not have to be considered therefore), whereas the server 
becomes idle again: 



(gen_server:busy({noreply, state}, pid), spid, sq) 
(gen_server: ready(state), spid, sq) 



( 6 ) 



Note that when we have two clients, one suspended and one calling the server, 
then both clients are in the waiting state, but only one client can be activated by 
the return of the handle_call, viz. the process which called. The other process 
must be activated by explicitly using the gen_server: reply function: 



(r[gen_server:wait(spzd)], pid, q) || 

(r'[gen_server: reply(pid, answer)], spid, sq) (7) 

{r [answer], pid, q) || (r' [true], spid, sg) 



In this way semantical rules can be used to accurately describe the given 
example of the locker server. As can be seen, the asynchronous communication 
actions that are used in the gen_server module to implement synchronous mes- 
sage passing are abstractly represented by simple handshaking operations which 
do not consider the message queues of the client nor of the server. 

Asynchronous communication, not on the level of Erlang, but on the level 
of gen_server is also supported. This is implemented via the gen_server: cast 
and handle_cast functions. The gen_server: cast mechanism is formalized sim- 
ilar to the gen_server: call mechanism. The only difference is that the client 
immediately proceeds without waiting for a server response. Having evaluated 
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the handle_cast function (which is indicated by a noreply result tuple), the 
server just changes into the gen_server: ready state, modifying its local data 
according to the result. 

Apart from the reply and noreply values, hauidle_call (and handle_cast) 
can instead return a result that indicates that the server has to terminate. If so, 
the terminate function in the callback module is invoked. In this situation, the 
response is stored on the server side until terminate has finished: 

(gen_server:busy({stop, reason, answer, state} , pid) , spid, sq) 

— > (gen_server: down(terminate(reason, state), pid, answer), spid, sq) 

The terminate function is supposed to return the value ok and after that, the 
client is released and the server process is removed: 

(r[gen_server:wait(spid)], pid, q) || 

(gen_server: down(ok, pjd, answer) , spid , sq) (9) 

— ^ {r [answer], pid, q) 

Note that the callback functions such as init and hauidle_call are specified 
in the callback module. We use the rules provided by the system to reason 
about their behaviour. Thus, abstraction by means of the semantical rules is 
only provided for the generic part of the server. 

4 Implementation 

As mentioned in the introduction, our gen_server verification approach has 
been implemented using the Erlang Verification Tool. Some minor additions had 
to be made to the tool itself, basically the recognition of the special constructs 
whose handling is left to the user. Thus, whenever a special construct occurs, the 
tool is aware of the fact that this is not a normal form, but that transitions may 
arise from the respective term. It leaves it to the user to check which transitions 
are possible. We, as a user, provided tactics to analyze the term and to apply 
the corresponding transition rule. These tactics are combinations of proof rules 
that are applied to the proof goal. Hence, we specified the transition rules given 
in Sect. E]as logical formulae. 

Specifying the transition rules as logical formulae rather than integrating 
them as an extension of the original Erlang semantics into the EVT source code 
has two advantages: first of all, the reasoning within the tool remains sound 
with respect to the abstract gen_server semantics. By using an abstract model 
of a server one introduces a potential unsoundness. If the property that one 
wants to prove depends on the actual implementation of the server, one might 
be able to falsely prove it for the abstraction, whereas it does not hold for the 
real program (this point is discussed in the conclusions). By using logical rules 
for the transitions, one explicitly states in the assumptions how one expects the 
server to behave. Since the reasoning that involves these assumptions is based on 
the (sound) EVT proof system, soundness is guaranteed under the premise that 
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the (low-level) implementation of the gen_server module behaves as described 
by the (high-level) specification. 

Second, one obtains a greater flexibility for experimenting. It is easier to 
change a logical expression and a tactic than to modify the implementation of 
EVT itself. Moreover, several different specifications of the server may exist at 
the same time, all using the same tool. 

In the following we give the logical representation of the transition rules 
which describe the starting of a server process. As mentioned earlier, the formal 
semantics of Erlang is given by a two-layer scheme [Erej . which is also used in 
the EVT implementation. First the Erlang expressions are provided with a se- 
mantics on the expression level. The actions here are a functional computation 
step, an output, a receiving of a message, and a call of a builtin function (like 
spawn for process creation) with side effects on the process-level state. Second, 
the transition behaviour of Erlang systems (that is, concurrent processes evalu- 
ating expressions in the context of a unique process identifier and a mailbox of 
incoming messages) is captured through a set of transition rules which lift the 
expression actions to the process level, and which describe the interleaving of 
concurrent actions. Here, possible process actions are computation steps, input, 
and output actions. 

In this setting, (|T) is decomposed into an expression-level rule which indicates 
the spawning action, and a process-level rule which models the actual process 
creation: 



spawn(/, [ui, ... ,-(;„]) 



s'paL\]n{f,[vi,...,Vn])—^pid' 



pid' 



spawn(/,[-ui,... , . 

e — > e pid yt pid 

{e,pid,q) {e',pid,q) || {f{vi,...,Vn),pid',£) 

These transition rules in the tool automatically generate possible next states 
of an Erlang system when we want to prove that something holds in some suc- 
cessor state or in all successor states (diamond or box modality, respectively). 
Thus, given a spawn call in the program, we can reason about the state in which 
a new process is created, among the other possible next states that the tool 
computes for us. For the generic server behaviour we want to obtain a similar 
level of comfort. However, now we specify the transitions as logical formulae. 
For example, m gives rise to the following logical formula which describes the 
starting of a server. 



'imod : Atom, 'i arg : Value, 'i opt : List. 
fyspid : Pid. 



. gen_server: start(mo<i, arg, opt) 



spawn( init , [arg] ) — >• spzd 



gen_server: wait(spj(i) 



Va : Action. Ve : Expr. gen_server: start(mod, arg, opt) — ^ e => 

^spid : Pid. a = spawn(init, [arg]) — >■ spid A e = gen_server:wait(spj<i)^ 



By associating a spawn action with the gen_server: start function call, we 
employ the process-level rule for spawn as given above for computing the next 
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state. The server starts evaluating the init function^, which may involve some 



standard reasoning with the tool. 

However, since we now specify the transition as a logical rule, we also need 
to state that no other transition is possible from this point, as expressed by 
the second subformula of the conjunction. The implementation of our generic 
server primitives in the tool is such that it presents all known actions at every 
point where one of these primitives may enable a transition. We manually have 
to prove that a certain transition is possible and that the others are not. The 
information needed for this proof is provided in the assumptions by means of a 
logical formula as presented above. 

Since we try to achieve a high degree of automation in our proofs, we provided 
tactics for automatically showing that a certain action can take place and that 
the others are not enabled. For example, when proving a diamond property for 
an expression where a server is started, the tactics scan in the assumptions of the 
goal for a property named start_dia, and this property is used to automatically 
prove that after the gen_server: start call a new process is created evaluating 
the init function. 

Rule d2) expresses that, after the successful initialization, the server pid is 
returned to the spawning process. Since it makes a provision on the syntactic 
structure of the two expressions computed by the processes, we have to give 
four formulae in the EVT specification. The first two specify the local effects of 
sending/receiving a message by/from the server, respectively. 

'Estate : Value. 

/Wspid : Pid. Wpid : Pid. \ 

I , , server(spid,pid)—^{ok,spid} , \ I 

y|ok, stotej — > gen_server: ready(stote) / 

( Va : Action. Ve : Expr. {ok, state} — ^ e 3spid : Pid. 3pid : Pid. { 
a = ser ver{spid, pid) — >■ {ok, spid} A e = gen_server: ready(stote) J 



Furthermore the expression-level synchronization actions have to be lifted 
to the process level. We only give the formula for the server side; the dual 
one (defining corresponding input actions of the form spidl sync (pid , v)) is of a 
similar shape. 

Ve : Expr. Ve' : Expr. Wspid : Pid. Vpzd : Pid. Vu : Value. Wsq : Queue. 



To be precise the function mod-.init is called, but the tool, at the moment, lacks 
support for function calls in remote modules, such that we have to localize all calls. 
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Now the standard synchronization mechanism implemented in EVT is used 
to model the actual communication between the two sync events. Note that 
both the client and the server pid is used to match the synchronization actions. 

Similar formulae are obtained for to describing the synchronous 
gen_server: call mechanism and the termination of a server process, and for the 
remaining gen_server constructs. All single properties are collected in a big con- 
junction named gen_server, which completely specifies the abstract behaviour 
of gen_server systems, and which can be found in |ANflflj . 

By implementing the transition rules as logical formulae we gain a flexible 
and easily adaptable extension of the tool. However, by the fact that we also 
need the ‘negative’ information that a certain transition is the only possible, we 
have some overhead in disproving extra generated subgoals for all other types 
of transitions. These subgoals are, however, relatively easy to reject and we 
implemented tactics to deal with them automatically. 

For example it is possible to establish the mutual exclusion property of the 
locker protocol as defined in Sect. [21 Let us assume that the access_the_resource 
function, which implements the client’s activity in the critical section, is given 
by 

access_the_resource 0 -> 
self 0 ! access . 

Hence the fact that within a given process system S a client with pid Client 
has entered its critical section can be expressed by the formula 

in_cs: erlangPid -> erlcLng_system -> prop = 

\Client: erlangPid. 

\S; erlajig_system. 

(S: <Client ! message (access) >tt) . 

In the simplest case of a system with two clients using one locker, the mutual 
exclusion property can be characterized by the following safety formula, asserting 
for every reachable state of the system that not both clients are in their critical 
sections at the same time: 

mutex: erlangPid -> erlauigPid -> erlang_system -> prop => 

\Clientl: erlangPid. 

\Client2: erlangPid. 

\S: erlang_system. 

( not (S: (in_cs Clientl) /\ (in_cs Client2)) 

/\ 

(S: [tau] (mutex Clientl Client2)) 

). 

We have experimented with our implementation of the generic server specifi- 
cations applying it to some small examples like the above. Indeed the proofs are 
easier and shorter than without using this extension. On larger examples this 
benefit can only become more significant. 
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5 Conclusions and Future Work 

In this article we presented an addition to the Erlang Verification Tool that 
enables us to reason on a higher level about the code implementing a client- 
server architecture using the generic server paradigm. The addition drastically 
reduces the amount of details one needs to consider when proving a property of 
an Erlang application that uses this client-server architecture, therefore resulting 
in shorter proofs. 

We formalized the operational semantics of the generic server behaviour sim- 
ilar to the way in which the operational semantics of more primitive Erlang 
functions is specified. For implementing this formalization we defined it in terms 
of logical formulae such that the only change in the tool that we required was 
the extension of the list of recognizable keywords by the gen_server-specific 
function names. It turned out that, in comparison to proofs based on the con- 
crete implementation, the logical formulae support a more effective reasoning 
about client-server systems. The efficiency of this reasoning has been increased 
by adding tactics that automatically prove subgoals about impossible alternative 
transitions. 

By specifying only the ‘essential’ behaviour of any reasonable gen_server 
implementation, we introduce a certain unsoundness in the proofs. That is, if 
employing our approach we succeed to prove a certain property of an Erlang 
program that contains servers implemented by the gen_server module, then it 
depends on the property whether it really holds for the actual code. This, how- 
ever, is a consequence of abstraction and not at all a problem in practice. First of 
all we are not so much interested in proving correctness but rather in finding er- 
rors in the program. If we find an error with respect to this abstraction, it is most 
likely an error in the real code as well. Second, if the property is independent 
of the specification of the generic server, then the property should hold for the 
actual code. Since we abstract several server steps into one handshake operation 
in our semantics, the property should be at least t - insensitive, i.e., its valid- 
ity should not depend on the number of internal actions the system evaluates. 
The T-insensitivity is a minimal requirement of the property, but insufficient, 
since there are several other issues involved as well that make it very hard to 
formalize the exact independence criteria. For example, the property should be 
independent of: the number of messages in the mailbox of a server, the priority 
used to read messages from a mailbox, the debug and fault tolerance additions 
not specified by us, etc. 

Pragmatically, our concern is to provide a framework in which we can prove 
properties of the code in an abstract setting, where we use one abstraction for 
all possible properties. This abstraction is very close to the real implementation, 
but there will always exist properties for which it turns out to be too general. 
However, if we can prove a certain property about the abstraction, then we 
increased the level of confidence in the code; if we find that a certain property 
does not hold by reasoning in this abstracted setting, then, most likely, this 
corresponds to an error in the real program. For that part, our technique is 
therefore rather close to the model-checking approach. Here, however, we only 
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need one abstraction for arbitrary properties and we do not have to abstract 
over unbounded data structures, dynamic process spawning, or dynamic network 
creation. Our approach obtains the abstraction automatically, but needs human 
assistance in non-trivial proofs, whereas the latter can often automatically be 
handled when using a model checker. 

In order to compare efforts, we experiment with using the same formaliza- 
tion of the behaviour of the generic server and its callback module with model- 
checking tools (such as Truth/SLC |LLNT99] ). Since one possible source of 
infinite state spaces, the unbounded message queue of an Erlang process, is ab- 
stracted away in our model, this approach should potentially be more successful 
than for arbitrary Erlang programs. However, we can still apply those tools only 
to examples where the state space is finite; in particular, the number of processes 
must be bounded. 

The Open Telecom Platform contains several additional generic architectures, 
such as generic finite-state machines, generic event handlers, generic supervision 
trees, etc. Those concepts can be formalized and added to EVT along the same 
lines. A major part of the tactics that we have already written will directly be 
usable for those other generic concepts. In this way, with only little extra effort, 
the verification of even more realistic large applications can be simplified. 
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Abstract. This paper presents the design and implementation of Glas- 
gow distributed Haskell (GdH), a non-strict distributed functional lan- 
guage. The language is intended for constructing scalable, reliable dis- 
tributed applications, and is HaskeU’98 compliant, being a superset of 
both Goncurrent Haskell and Glasgow parallel Haskell (GpH). 

GdH distributes both pure and impure threads across multiple Process- 
ing Elements (PEs), each location is made explicit so a program can use 
resources unique to PE, and objects including threads can be created on 
a named PE. The location that uniquely owns a resource is identified by a 
method of a new Immobile type class. Impure threads communicate and 
synchronise explicitly to co-ordinate actions on the distributed state, but 
both pure and impure threads synchronise and communicate implicitly 
to share data. Limited support for fault tolerant programming is pro- 
vided by distributed exception handling. The language constructs are 
illustrated by example, and two demonstration programs give a flavour 
of GdH programming. 

Although many distributed functional languages have been designed, rel- 
atively few have robust implementations. The GdH implementation fuses 
and extends two mature implementation technologies: the GUM runtime 
system (RTS) of GpH and the RTS for Concurrent Haskell. The fused 
RTS is extended with a small number of primitives from which more 
sophisticated constructs can be constructed, and libraries are adapted to 
the distributed context. 



1 Introduction 

Distributed languages are used for a number of reasons. Many applications, par- 
ticularly those with multiple users, are most naturally structured as a collection 
of processes distributed over a number of machines, e.g. multi-user games, or 
software development environments. Applications distributed over a network of 
machines can be made more reliable because there is greater hardware and soft- 
ware redundancy: a failed hardware or software component can be replaced by 
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another. Distributed architectures are more scalable than centralised architec- 
tures: additional resources can be added as system usage grows. 

We distinguish between large-scale and small-scale distribution. Large-scale 
distributed applications are supported by standard interfaces like CORBA 
| |Sie97| or Microsoft DCOM | Mer96| and may have components written in multi- 
ple languages, supplied by several vendors, execute on a heterogeneous collection 
of platforms, and have elaborate failure mechanisms. In contrast, small-scale dis- 
tributed programs entails components written in a single language, typically con- 
structed by a single vendor, and is often restricted to an homogeneous network 
of machines, with a simple model of failures. Small-scale distributed applications 
are typically constructed in a distributed programming language, e.g. Java with 
Remote Method Invocation (RMI) [DSMS98j . A distributed language allows the 
system to be developed in a single, homogeneous, framework, and makes the 
distribution more transparent to the programmer. 

Functional languages potentially offer benefits for small-scale distributed pro- 
grammin g, and se veral have been developed, e.g. Kali Scheme |CJK95| . Facile 
Antigua |TLP~*~93j . OZ |HVS97j . Concurrent Clean [PV98j . and Piet [PT97j . They 
allow high level distributed programming, e.g. capturing common patterns of 
distribution as higher-order functions. Functional languages provide type safety 
within the constraints of a sophisticated, e.g. higher-order and polymorphic, 
type system. Several benefits accrue if significant components of the application 
are pure, i.e. without side-effects. Such components are easy to reason about, 
e.g. to optimise, derive or prove properties. Pure components can be evaluated 
in arbitrary order, e.g. lazily or in parallel. It also may be easier to implement 
fault tolerance for pure computations because a failed computation can be safely 
restarted ITPLOOI . 

We have designed and implemented a language based on (non-strict) dis- 
tributed graph reduction, in the anticipation of the following benefits that we 
seek to demonstrate in future work. Not all synchronisation and communica- 
tion between threads need be explicit, in particular the shared graph model 
means that a thread has implicit (read-only) access to variables shared with 
other threads. Moreover, all data transfer between threads is lazy and dynamic. 
The cost of laziness is an additional message from the recipient requesting the 
data, but there are several specific benefits. Lazy transfer is useful if part of a 
large (or infinite) data structure is to be exchanged. Logically the entire data 
structure is exchanged, but the receiving thread will only demand as much of the 
data structure as is needed. Lazy transfer automatically avoids the problem of 
a fast producer flooding a slow consumer’s memory. Dynamic transfer is useful 
if the amount of data to be sent is hard to determine a priori, or varies between 
program execution. 

Section 0 outlines distributed architecture and language concepts. Section |2] 
describes GpH and Concurrent Haskell. Section 01 presents the design of GdH, 
discussing the motivation. Section describes the implementation of GdH. Sec- 
tion [^presents and discusses two small GdH demonstration programs. Section |3 
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compares our approach to some other distributed functional languages before 
section [8] discusses future work and section |9] concludes. 

2 Distributed Language Concepts 

To provide a framework in which to present our design, we define both distributed 
architecture and language concepts. Distributed languages execute on multiple 
processors connected by some network. A Processing Element is a processor 
with associated resources such as memory, disk, and screen. A Thread is an 
independent stream of execution within a program. In functional languages we 
distinguish between Pure threads that have no side-effects, e.g. perform no I/O, 
and Impure threads which may manipulate state, e.g. by performing I/O. A 
Process is a set of threads executing a program and sharing a common address 
space and resources such as files. 

A distributed language may or may not make architectural entities explicit, 
e.g. threads are explicit in many languages, but implicit in others. More often 
the language provides abstractions of the architecture level entities, e.g. naming 
a PE. There are several important concepts for distributed languages. 

Locations. A location is a set of resources, e.g. files, memory, etc. A location 
is an abstraction of a PE and its resources. A process may be considered to be a 
location with threads. A language is location independent ii locations are implicit. 
A language is location aware if locations are explicit, enabling the programmer 
to utilise the resources of a location, e.g. forking a new thread into a named 
location. 

Communication and Synchronisation. Communication involves the exchange 
of data and synchronisation is the co-ordination of control between threads. The 
two are closely related as having one allows the implementation of the other. Non- 
determinism naturally results from communication when messages come from 
multiple threads. In languages with implicit communication/synchronisation 
threads typically communicate and synchronise using shared data, freeing the 
programmer from describing the communication/synchronisation. For exam- 
ple Java threads may share a class of objects and communicate using syn- 
chronised methods. In languages with explicit communication/ synchronisation 
threads within a process typically communicate/synchronise using a shared re- 
sources of the location, e.g. a semaphore. If the threads belong to different pro- 
cesses then communication/synchronisation may either address the thread or 
some other common location like a channel or port. 

Centralised/Decentralised. There is no reason why communicating threads 
must belong to the same program, and often large systems consist of multi- 
ple co-operating programs. Centralised languages are a single program, and this 
approach has the advantage that the program and the inter-thread communi- 
cation/synchronisation can be statically typed. Decentralised languages allow 
multiple programs to interact using a predefined protocol, e.g. a client-server 
model. This requires some language support to initialise communication. Such 
languages support dynamic systems that can be extended by adding PEs and 
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new programs. However, the communication between such a dynamic set of pro- 
grams cannot be statically typed. 

Fault Tolerance is the ability of a program to detect, recover and continue 
after encountering faults. Faults may either be internal to the program, e.g. 
divide by zero, or external, e.g. disk failure. 

Distributed languages support explicit communication/synchronisation be- 
tween multiple threads on multiple PEs. A distributed language may be cen- 
tralised or decentralised, and typically provide some support for fault toler- 
ance. Functional languages often attempt to relieve the programmer of the bur- 
den of managing distribution, that is, they often provide implicit communica- 
tion/synchronisation, and a degree of location independence. 



3 GpH and Concurrent Haskell 

Haskell’98 is sequential and programs execute as a single impure thread, termed 
an I/O Thread, that is executed by one PE. Two well-developed parallel exten- 
sions to sequential Haskell are GpH and Concurrent Haskell. GpH targets parallel 
transformational programming |Loo99| . i.e. the program takes some input, per- 
forms some parallel calculation and produces an output. In contrast Concurrent 
Haskell is aimed at reactive systems, i.e. the program constantly interacts with 
its environment, not necessarily terminating. 



3.1 GpH 

GdH [THM+^ 



is a small extension to Haskell’98 that executes on multiple 
PEs. The program still has one main I/O thread, but may introduce many pure 
threads to evaluate sub-parts of the heap in parallel. Pure Threads are advisory, 
i.e. they may or may not be created and scheduled depending on the parallel 
machine state. The par function is used to suggest that an expression may be 
evaluated in parallel with another by a new thread. Pure threads are anonymous 
in that they cannot be manipulated by the programmer once created. Parallelism 
can be further co-ordinated using seq to specify a sequence of evaluation - that 
one expression is evaluated before another. Higher-level co-ordination of parallel 
computations is provided by abstracting over par and seq in lazy, higher-order, 
polymorphic functions, called evaluation strategies [ THLP98] . 



3.2 Concurrent Haskell 

Goncurrent Haskell p^HA~*~9?j adds several extensions to Haskell’98. The program 
consists of one main I/O thread, but now the programmer has explicit control 
over the generation of more I/O threads and the communication between them. 

I/O Threads are created explicitly by the monadic command, forkID. A new 
I/O thread is mandatory, i.e. it is created and must be scheduled by the RTS. 
Once created it can be addressed by its threadid to further manipulate their 
operation, e.g. to terminate it: 
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forklO : : lO () ^ lO Threadid 
myThreadId : : JO Threadid 



Synchronisation and Communication. Implicit inter-thread synchronisation 
occurs within the shared heap, as threads block upon entering shared closures 
that are under evaluation by other threads. Explicit thread synchronisation and 
communication occurs within the monadic I/O system by the use of polymorphic 
semaphores - MVar. An MVar is created by newEmptyMVar, and is a container 
that has a state of either empty or full. Using takeMVar returns and empties the 
container contents if it was full, otherwise it blocks the thread that is attempting 
to take. A putMVar fills the container giving an error if it was already full. 
Multiple threads may share an MVar, in which case operations on it may be 
non-deterministic : 



These primitives can then be abstracted over to give buffers, FIFO channels, 
merging, etc jPHA+97] . 

Fault Tolerance is supported by exceptions which allow the flexible handling 
of exceptional, or error situations by changing the flow of control within a thread. 
Synchronous exceptions occur within a threads execution, e.g. divide by zero. 
Asynchronous exceptions occur outside of the thread, somehow affecting it, e.g. 
an interrupt generated when the user hits <ctrl> — C: 



4 Design of GdH 

GdH provides the following facilities to support distributed programming: 



— Location Awareness - through new language constructs enabling the manip- 
ulation of the specific resources at each location. For example to interact 
with the GUI of a specific user at a machine. 

— Explicit Synchronisation/Communication - enabling the co-ordination of 
distributed impure threads. For example to synchronise multiple users who 
are sharing a resource like a gameboard. 

— Location Independence - to maintain backward compatibility with Goncur- 
rent Haskell and GpH. This allows program behaviour to remain the same 
even when a program is distributed. 

— Fault Tolerance - has some support in GdH so that robust programs may 
be constructed. 

— A Centralised - approach is adopted, although future versions of GdH may be 
decentralised. A decentralised distributed Haskell is described in Section 0 

A more complete description of the design of GdH can be found in [PoiOlj . 



newEmptyMVar : : JO (MVar a) 
takeMVar : : MVar a — >■ JO a 
putMVar : : MVar a — ?> a — JO (/) 

isEmptyMVar : : MVar a — )> JO Bool 



raiseInThread 

throw 

catchAllID 



Threadid — >■ Exception — >■ a 
Exception — >■ a 

lO a -^(Exception lO a) ^ lO a 
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4.1 Location Awareness 

A GdH program executes at a set of locations. Each location is labelled with 
a value of a new abstract data type PEId. A PEId cannot be constructed by 
the programmer, instead it must be obtained by querying the program state. 
Locations are part of the global state of the program, therefore GdH primitives 
to interrogate this state must be monadic commands rather than pure functions. 

A GdH program is centralised with a distinguished main location which rep- 
resent where the program was started from, this location consists of the main 
I/O thread and provides the environment with stdin, stdout, etc. 

To support location awareness the program must be able to obtain four vital 
pieces of information: Where is this location, where are other locations, where 
is an object located, and what is located here? Then to utilise the resources and 
attributes of a location it is necessary to have some primitive to specify a remote 
operation. 

Where is this location? This is a request for the current location which is 
provided by the function myPEId: 

myPEId : : JO PEId 

Where are other locations? The set of all PEIds must be provided, which is 
returned conveniently as a list. In recognising that a GdH program will be a 
centralised program with a distinguished main location, the further decision is 
made that the head of the list returned by allPEId should always be the main 
location: 



allPEId : ; JO [PEId] 

Where is an object located? Objects, i.e. items representing part of the world 
state such as: MVars, threads, files, etc, are made stationary at one location 
and only one copy of them ever exists. An object is constructed at a particular 
location because that is where the current thread is executing and often an object 
must be associated with a particular location because it is immobilised at that 
location, e.g. files and foreign objects. 

A new Haskell class. Immobile, groups stateful objects together, and has a 
method owningPE which returns the location of the specified object: 

class Immobile a where 

owningPE : ; a — >■ JO PEId 

instance Immobile PEId 
instance Immobile (MVar a) 
instance Immobile Threadid 

What is located here ? Each location has access to its local environment and 
existing Haskell interrogation commands, e.g. on files, environment variables, 
function appropriately. 
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A Remote Operation is executed by a remote impure I/O thread to ma- 
nipulate state. Remote operation also allows the creation of process networks 
spanning multiple locations. 

A new command, revallO, is provided in the I/O monad. Calling revallO 
j ob p causes the calling thread to block until the execution of j ob at the location 
p completes and returns a result, i.e. it has the effect of temporarily changing 
the location of the current thread. Conceptually revallO is very similar to Java 
RMI: 



revallO : : JO a — )> PEId lO a 

The command revallO represents the current thread temporarily changing 
location, and so preserves all location independent properties of that thread. 
This has an advantages for fault tolerance as the error handling capabilities of 
the thread are location independent, i.e. an exception raised in a thread that is 
within an revallO will propagate back until an exception handler is found. The 
exception handler may be within the job at a remote location, or in the original 
thread. 

Object placement can be accomplished by revallO, this allows the creation 
of an object at a specific location, for example to create a distributed version of 
the Concurrent Haskell forklO command that places a thread: 

rforklO : : JO /j — >■ PEId — >■ lO Threadid 

An example of usage of these location aware functions is shown in Figure [U 
The program show that regardless of where a thread executes it consistently 
determines that a resource is at a particular location. The output of the pro- 
gram is a list of pairs showing thread location and resource location. Firstly 
the resource, an MVar, is created by newEmptyMVar and the list of of available 
locations is obtained via allPEld. A function, work, is defined that uses myPEld 
to determine the thread location and then utilises owningPE to determine the 
location of the MVar, the result is returned as a pair of PEIds. The monadic 
map operation, mapM, is used to call this new function, work, for each location. 
The result of the monadic map is a list of pair of PEIds that is shown by the 
putStrLn. 



main = do 

ps <- allPEId 

m <- newEmptyMVar -- create the MVar on the main PE 

let work = do 

i <- myPEId 

o <- owningPE m -- where's the MVar? 

return (i,o) 

rs <- mapM { \p ->(revalIO work p) ) ps -- map work across all PEs 
putStrLn (show rs) 

-- Output: [(262215,262215), ( 524319 , 262215 ) , (393218 , 262215 ) ] 



Fig. 1. Using the location aware primitives 
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4.2 Location Independence 

In a functional language the majority of the data is non-mutable, some data 
represents suspended computation and thus is mutable once, while the remainder 
of the data is objects that are mutable many times. When data is distributed 
over multiple locations then some form of synchronisation is necessary so that 
the data remains consistent. 

Suspended computations synchronise implicitly using the same approach as 
GpH and Concurrent Haskell, where the first thread that enters the suspension 
will perform the evaluation and other threads that enter then block until the 
evaluation completes. 

Location independence of objects requires that the language supports some 
means for manipulating an object regardless of its location. The design decision 
was made to rewrite the relevant libraries, i.e. for MVars, threads, etc, to encap- 
sulate and hide the location dependent properties. To do this the mechanisms 
of location awareness can be used by allowing a remote operation via revallO 
on the object once its location has been determined by owningPE. 



4.3 Fault Tolerance 

Exceptions are a useful construct for supporting fault tolerance. The synchronous 
and asynchronous exceptions supported by Concurrent Haskell are extended so 
that they operate in a location independent manner, that is, exceptions can be 
raised at one location and handled at a remote location. 

The only source of synchronous exceptions between locations in CdH occurs 
within the revallD mechanism when a thread has temporarily migrated to an- 
other location as discussed earlier. Asynchronous exceptions between locations 
are possible when a raiseInThread function is applied to a thread which re- 
sides at a remote location. The semantics are extended for location independent 
operation, but no new language constructs are required. 

New exceptions may be raised in response to a failure of a PE or connection 
to a PE. The detection of this class of errors is difficult as often the system 
cannot distinguish between the loss of a PE, or the loss of a message, from a 
message being delayed. The handling of these errors is also problematic as the 
loss of any part of the virtual shared heap can result in dangling heap references 
and therefore corrupt data and code. Handling these errors is critical for the 
construction of robust system and we have made an initial study [TPTjflfl] but 
not yet implemented our design. 



5 Implementation 

To implement CdH the following steps were necessary. The GpH and Concurrent 
Haskell runtime systems had to be merged into a new RTS. The new language 
primitives (myPEId, allPEId, owningPE, and revallO) for location awareness 
need to be provided in a new Haskell module for the programmer, and require 
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Haskell’98 


Concurrent Haskell 


GpH 


GdH 


PEs 

Location 
Centralisation 
Threads 
Communication 
Fault Tolerance 


One 

N/A 

N/A 

One 

N/A 

None 


One 

N/A 

N/A 

Many impure 
Implicit & explicit 
Exceptions 


Many 

Independent 
Centralised 
Many pure 
Implicit 
None 


Many 

Aware 

Centralised 

Many pure & impure 

Implicit & explicit 

Exceptions 



Fig. 2. The relationship between Haskell’98, the extensions GpH and Goncurrent 
Haskell, and the new extension GdH 



implementation in the merged RTS. Finally the existing libraries need to be 
extended to operate safely in a location independent manner. Each of these steps 
is discussed below, and more detailed information may be found in |Poifl1 ] . 

5.1 Merge Runtime Systems 

GdH fuses and extends the runtime systems used by the Glasgow Haskell Gom- 
piler (GHG) [PHH~*~9^ . GHG supports not only Haskell’98 but the extensions for 
Goncurrent Haskell and GpH. The standard GHG RTS has extensions for con- 
currency and exceptions for use in Goncurrent Haskell. GpH is implemented by 
using a second RTS, GUM, that uses PVM for low-level location communication, 
adds support for parallelism and a virtual shared heap. 

The GdH RTS, Sticky GU^iQ, is an extended fusion of GUM and the GHG 
RTS. The two original runtime systems share significant amounts of code, yet 
were not primarily designed to coexist. The overlap of the different RTS exten- 
sions and the languages they support can be seen in Figure El One of the major 
differences in the GHG RTS and GUM, is what requires synchronisation: the 
GHG RTS requires additional synchronisation for the implementation of MVars, 
thread delays, and exceptions; where GUM requires additional synchronisation 
for the portions of the heap that are shared across multiple locations. 




The RTS is sticky in that immobile objects adhere to a particular location. 



1 
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5.2 New Language Primitives 

The language design identified the following new language primitives: myPEId, 
allPEId, owningPE, and revallO. These primitives form part of the new Haskell 
module Distributed shown in Figure 



data PEId 

myPEId •• •• 10 PEId 

allPEId •• •• 10 [PEId] 

class Immobile a where 

owningPE : : a -> 10 PEId 
revallO ; ; 10 b -> a -> 10 b 
revallO job xx = do 

p <- owningPE xoc 
doRevallO job p 

instance Immobile PEId 



abstract 

the current location 
list of all locations 

the location of an object 
remote evaluation to an object 

a location is immobile 



Fig. 4. Interface to the new Haskell module Distributed 



The current location is returned by myPEId. This result corresponds directly 
to the unique taskid from the underlying PVM system. It is straightforward to 
call a C function, through the Haskell foreign language interface via a callc, to 
fetch this value. 

The list of available locations is returned by allPEId. GUM already stores 
a table of all locations for use by the garbage collector and global addressing 
subsystems. The allPEId function accesses the RTS via a callc to build a list 
from this table. 

The Immobile class contains both owningPE and revallO as with practical 
usage of the language it quickly became apparent that one of the most common 
uses of the result of owningPE is to then immediately call revallO, thus sending 
a specific piece of work to where the resource is located. Therefore revallO 
can now operate on any member of the Immobile class, automatically calling 
owningPE and then the underlying primitive doRevallO which uses values of 
type PEId only. If the programmer still wishes to reference the location of an 
object explicitly then they may do so since PEId is a member of the Immobile 
class. 

Implementing Immobile Objeets. Sticky GUM maintains information about 
which types are immobile and the Haskell programmer cannot alter this. When 
immobile objects are communicated between locations by the RTS, they are 
first converted into a new closure type, REMOTEREF. Hence there only ever exists 
one copy of the object and multiple REMOTEREF closures that refer back to it. 
The RTS assigns a unique global address to closures communicated between 
locations. The REMOTEREF has a global address which includes the original PEId 
that is accessed by owningPE via a callc. 

Implementing Remote Evaluation. A new PVM message REVAL is defined 
that carries the information necessary for the generation of the remote thread. 
When the remote location receives an REVAL, it creates a mandatory thread and 



The Design and Implementation of Glasgow Distributed Haskell 63 

immediately begins executing it. Upon termination of this remote thread, the 
result is sent back to the original thread via the existing GUM RESUME message, 
as depicted in Figure |5] 




Fig. 5. The remote evaluation process 



The execution of a doRevallD is outlined in Figure E] and proceeds as follows. 
It tests if the destination location is the same as the current location and if so 
then performs the optimisation of doing the work locally. For the cases involving 
a remote location, an exception handler is installed around the piece of work, 
job, which is wrapped up in a constructed type. Status, so that a valid result 
value is always returned. The primitive unsaf ePerf ormlO is used to write the 
result to a single closure, result, this closure is where the synchronisation of 
waiting for the remote result takes place. 



data Status a = Okay a | Fail Exception 

doRevallO : ; 10 a -> PEId -> 10 a 
doRevallO job p = do 
i <- myPEId 
if i==p 

then job -- do it locally if you can. 

else do 

_ccall_ cRevallO result p -- send the work off. 
case result of -- check the result. 

Okay r -> return r 
Fail e -> throw e 

where 

tryjob = do -- construct an 'Okay' result. 

r <- job 
return (Okay r) 

caughtjob = catchAllIO tryjob (\e -> return (Fail e) ) 
result = unsaf ePerformlO caughtjob 



Fig. 6. Blocking and error handling of doRevallO 



The synchronisation mechanism used by Sticky GUM to block the original 
thread while the remote evaluation takes place is almost identical to that used 
in GUM. In Figured the root of the work to be sent is a closure, so when the 
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doRevallO function calls a callc to create the REVAL message it also changes 
the closure into a blocking queue that has an initial state of blocked. A block- 
ing queue is used to synchronise activities such as waiting on an MVar, or for 
evaluation to complete. 

Finally doRevallO enters the closure in a case statement to check if the result 
should raise an exception or not. By entering the closure it automatically blocks 
the local thread until the RESUME message arrives with the result. 

5.3 Extend Existing Libraries 

Many of the libraries need to be made location aware. The design identified that 
many useful constructs could be built from the four distribution primitives and 
much of the implementation to do so required rewriting existing libraries rather 
than providing totally new constructs. There are two types of extensions needed, 
the first to provide a location independent means for accessing objects, and the 
second to provide a method for specifying object placement. 

To illustrate the process we provide an example for threads. First threadid 
is made an instance of the Immobile class so that location independent access 
routines can be defined, i.e. for killThread and raiseInThread. Finally new 
object placement functions can be added, i.e. rforklO. Some functions may 
require no change, for example forklQ, Eq, and my Threadid: 

instance Immobile Threadid 

killThread th = revallO (Concurrent .killThread th) th 

raiseInThread th ex = revallO (Concurrent . raiseInThread th ex) th 
rforklO job p = revallO (forkID job) p 

6 Demonstrators 

We present two small demonstration programs to illustrate the new constructs 
provided by GdH; these programs make almost exclusive use of explicit com- 
munication. In contrast to these examples we intend GdH to be used for larger 
applications where the majority of the communication can be implicit, e.g. a 
game where multiple player can implicitly share a large environment, dynami- 
cally fetching the parts of the environment on demand. 

6.1 Ping 

A very simple distributed program is our UNIX ping-like utility that gives an in- 
dication of location-to-location communication cost by timing the use of revallO 
to perform a simple operation remotely. 

The code in Figure [7] obtains the list of available locations using allPEId. 
Then mapM is used to map loop across the locations. Within loop we use timeit 
to measure how long the revallO remote operation takes on that location. 

The results are given for a group of linux x86 PGs on our local network. The 
first result is approximately zero since the work is being executed locally. GdH 
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main = do 




pes <- allPEId 
putStrLn ("PEs = "+ 
mapM loop pes 

where 


+show pes) 


loop pe = do 




putStr ("Pinging "++show pe++" ... ") 


(name, ms) < 


- timeit (revallO remote pe) 


putStrLn ("at "++name++" time="++show ms++"ms") 


remote = getEnv " 


HOST" 


-- Output: PEs = [262344,524389,786442,1048586,1310730] 


Pinging 262344 . 


. . at ushas time=0ms 


Pinging 524389 . 


. . at bartok time=3ms 


Pinging 786442 . 


. . at brahms time=3ms 


Pinging 1048586 


... at selu time=2ms 


Pinging 1310730 


. . . at kama time=2ms 



Fig. 7. GdH Ping program 



use PVM for communication and the times returned are comparable to the PVM 
timings program which returned round trip times of 1.1 - 2.7ms. 



6.2 Co-operative Editor 

A more sophisticated distributed program is a co-operative text editor that sup- 
ports multiple text editor windows, on different machines, allowing users to com- 
municate through them and share files. Such an editor allows the sharing of files 
that are only accessable locally on a particular machine. An interface library in 
Haskell for the Tcl/Tk libraries, TclHaskell jSD99J . is used to create multiple 
instances of Tcl/Tk running a simple text editor (ted). A new menu within the 
editors is used to manage the distributed interaction. The menu enables an ed- 
itor to send its current buffer contents to all other editors, or for it to fetch 
messages sent to it from any other editor. 

The communication mechanism is a FIFO channel implemented via multiple 
MVars as provided in the standard libraries of Concurrent Haskell. The channels 
are used for two purposes: 

Termination Control - There is one global channel, named fin in Figure |S] 
and upon GUI quit or failure each GUI sends a message along this channel, 
which is then used by the startup thread to detect when every GUI thread 
in the program has terminated. The auxiliary functions newWait, rforkWait, 
and untilWait co-ordinate this behaviour, where rforkWait encapsulates the 
rforklO and additional exception handling. 

Data Exchange - Each GUI has its own channel, which is a FIFO buffer. By 
reading or writing to each channel it is possible to co-ordinate the data exchange 
between editors. Note, however, the data is transferred lazily, i.e. only when the 
receiving editor displays it. 

Initialisation is shown in Figure |S] where all the channels, the list ports, are 
generated by the first mapM. It uses reval to ensure that all channels are created 
separately on each location for efficiency. Later the pick function chooses the 
appropriate channels for each editor instance. The second mapM is used create 
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main = do 

pes <- allPEld 

putStr("PEs = "++(show pes)++"\n") 
fin <- newWait 

ports <- mapM (\p -> revallO newChan p) pes 
let remote p = do 

let startGUI = do 

name <- getEnv "HOST" 

primPutEnv { "DISPLAY= " ++name++ " : 0 . 0 " ) 
start $ (ted (pick ports p) ) 
rforkWait fin startGUI p 
mapM remote pes 
untilWait fin 



Fig. 8. Initialisation of the editor 



separate instances of the TclHaskell GUI running the editor (ted) on each loca- 
tion. Finally the termination control mechanism of untilWait causes the main 
thread to wait until all the GUIs have finished. 



buffer_menu ;; Context -> GUI () 
buffer_menu ctx@(ctx w mp e rf) = 

do m <- menu w [tearoff False] 

cascade mp m [wgt_label "Buffer"] 

mbutton m [wgt_label "Fetch", command doFetch] 

mbutton m [wgt_label "Send" , command doSend ] 

mbutton m [wgt_label "List" , command doList ] 

where 

doFetch = 

do FES c _ <- readstate rf 

empty <- proc $ (isEmptyChan c) 
if empty 

then return ( ) 

else do (f,s) <- proc $ (readChan c) 
resetEdit e s 
change_fn ctx f 



-- define the menu layout 



-- get input channel 



-- show text in the editor 
-- show source as filename 



do Send = 

do FES _ cs fn _ <- readstate rf -- get output chans& filename 

s <- getEdit e -- get text in editor 

host <- proc $ (getEnv "HOST") 

proc $ (mapM (\c -> writeChan c ( (fn++" (from@"++host++" ) " ) , s) ) cs) 



doList = 

do FES c _ <- readstate rf -- get input channel 

vs <- proc $ (snapChan c) 

resetEdit e (unlines (map (\(f,_) -> f) vs)) -- show text in the editor 
change_fn ctx " (message list) " -- show " (m. . . " as filename 



Fig. 9. Buffer communication for the editor 



Buffer Gommunication is handled within the new menu in the editor. The 
menu’s TclHaskell code is shown in Figure [Hi It defines a new menu named 
buffer with three options Fetch, Send, and List, and associates appropriate 
functions with each option. The function doFetch handles the receiving of the 
data with isEmptyChan being used to check if any message exists, and if so then 
readChan extracts the first message. Each message consists of two strings: the 
name of the buffer and the buffer contents. The second function doSend appends 
the name of the host machine to the file name and then uses writeChan mapped 
across all the other channels to send it’s buffer contents to all the other editors. 
The final function doList uses snapChan to take a snap-shot of the entire channel 
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contents and then maps a function across it to list all the names of the messages 
in the buffer. 

In the screenshot Figure [TO] the bottom two windows are instances of the 
editor, redirected via X to the same host, the other windows show PVM running 
and the console output. 
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Fig. 10. Editor screenshot 



7 Related Work 

Conventional distributed languages like Java or C-Split |DSMS98] provide high- 
level support for communication and synchronisation, e.g. remote procedure calls 
and synchronised methods. They typically have explicit, static task and data par- 
titioning, although some recent languages now support dynamic task and data 
placement, for example the object-orientated, functional, constraint based lan- 
guage Oz [H VS97j . Every value communicated between processes is sent explic- 
itly, and in many languages must be fully evaluated prior to transmission. Hence 
values transmitted must be first order, i.e. functions and infinite data structures 
cannot be transmitted. Programs are non-deterministic and the programmer is 
responsible for avoiding problems of deadlock and starvation. 

Like other distributed functional languages, GdH supports a more dynamic 
approach to distribution than conventional distributed languages. For example 
in GdH pure and I/O threads can be dynamically created, with pure threads 
dynamically allocated to PEs. Also data is dynamically communicated between 
locations on demand and it allows the transmission of higher order values. Like 
other distributed functional languages, GdH supports more implicit distribution, 
for example threads communicate and synchronise implicitly on shared data 
values. Using I/O threads a GdH programmer can construct programs equally 
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as expressive as conventional distributed languages, including programs that are 
non-deterministic or deadlock. 

While many distributed languages opt for the flexibility of a decentralised 
approach, GdH’s centralised approach makes verification easier: all of the inter- 
acting threads are part of a single program and both threads and communication 
can be statically typed. In the former approach, there is no common analysis of 
the co-operating programs. 

There have been many distributed functional language designs and imple- 
mentations, e.g. Facile Antigua [TTjP+ 98] . Goffin Distributed Haskell [( ;GKfl8j . 
OZ [HVMfl7J . Kali Scheme jG.IKflbJ . Goncurrent Glea,n |PVM] . and Piet [FTW] . 
Some are only paper designs, and others have only been supported by short-lived 
implementations. There are relatively few robust long-lived implementations, 
and we discuss some of the more recent and most closely-related to GdH. 

ERLANG represents arguably one of the most successful distributed func- 
tional languages so far. It was developed specifically in the telecommunications 
industry for writing concurrent, soft real-time, distributed fault tolerant sys- 
tems [Wik94J and has had considerable success in this area. In contrast to GdH, 
Erlang systems are decentralised. As a simple strict impure functional lan- 
guage, Erlang omits many features of modern functional languages. Gompared 
to GdH it omits: currying, higher-order functions and lazy evaluation. More im- 
portantly, coming from a logic programming background it is untyped which 
allows many programmer errors to go unnoticed at compile time. Erlang sup- 
ports a number of extremely useful features, especially useful for large complex 
applications, which are not available in GdH. These include hot loading of new 
code into running applications, explicit time manipulation to support soft real 
time systems, and message authentication. 

Haskell with Ports |HN00| is a new library for Haskell that adds the benefits 
of Erlang style communication using ports, thus allowing inter-process commu- 
nication and fault tolerance. Use of the library allows decentralised systems to 
be constructed that communicate with each other over ports using a predefined 
protocol with dynamic typing of the communication. Unlike GdH it only allows 
explicit communication and only first order values (including ports) to be trans- 
mitted through the ports. Gommunication of higher-order values is mentioned 
as possible future work. 

Brisk |Spi99| is a derivative of Haskell (currently partially implemented) 
which makes use of lazy evaluation to give deterministic concurrencyin a multi- 
ple demand driven approach. A deterministic form of communications based on 
merging with hierarchical timestamps is also introduced to extend the expres- 
siveness of the basic deterministic concurrency. Other useful features include the 
sharing of binary code between machines. Gompared to GdH, Brisk uses a more 
powerful pure functional approach without resorting to monadic style I/O, yet 
is closest in terms of the implicit and lazy communication. Brisk’s deterministic 
concurrency model is much more restrictive and prevents the expressing of many 
inherently non-deterministic programs, e.g. the dining philosophers problem. 
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8 Future Work 

Some minor work is still required to make the GdH implementation more robust. 
Once this is complete we plan to consider the following issues: 

Use and Evaluation of GdH - in comparison to conventional distributed lan- 
guages like Java. In particular we have constructed a GdH version of an existing 
distributed factory simulation [Kit92J . and are constructing a larger application 
- a multiuser game with map navigation that utilises implicit communication. 

Fault Tolerance. - The RTS can distinguishes between pure and impure com- 
putations: impure computations must be recovered using conventional exception- 
based techniques, but the RTS could attempt implicit recovery of pure compu- 
tations |TPL00| . 

Decentralised - systems would require GdH to provide connection/disconnec- 
tion language constructs. Sticky GUM would require further extensions to allow 
dynamic typing of communication and a robust virtual shared heap that allows 
PEs and their heap to disconnect. 

9 Discussion 

The design objectives and concepts underlying the distributed functional lan- 
guage GdH have been presented. GdH provides explicit threads, with explicit 
mapping onto PEs. Gommunication between threads is achieved via virtual 
shared memory, implemented as a shared heap in our graph reduction machine. 
Special features of our language are the implicit communication of, and syn- 
chronisation on, shared data, and the lazy dynamic communication of the data 
between locations. 

The implementation of GdH combines two mature runtime systems, and adds 
a small set of new primitives. The main modifications necessary to support the 
requirements of a distributed language affect remote thread creation, and the 
treatment of immobile objects. By basing our system on GHG we utilise mature 
compiler technology including: sophisticated sequential code optimisations, a 
foreign language interface, libraries for graphical user interfaces, etc. 

We plan to evaluate GdH on larger examples and to make it freely and 
publicly available as part of the GHG distribution. 
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Abstract. Algorithmic skeletons define general patterns of computa- 
tion which are useful for exposing the computational structure of a pro- 
gram. Being general structures they qualify as a target for parallelisation, 
which is most often carried out by providing specialised, non-portable, 
low-level parallel implementations {architectural skeletons) of each algo- 
rithmic skeleton for different platforms. In the paper we introduce an in- 
termediate layer of implementation skeletons for the parallel functional 
language Eden. These are portable high-level skeletons which simplify 
the design of parallel programs substantially. Runtime experiments on a 
network of workstations and on a Beowulf cluster have shown that even 
on such high-latency parallel platforms good speedups can be obtained. 

1 Introduction 

The inherent parallelism of functional programs often leads to many fine-grained 
tasks, while, due to costly communication and fast processors, conventional par- 
allel machines depend on coarse-grained tasks to deliver speedups. A parallel 
functional language has to bridge that gap, either in an implicit or explicit way. 
Ideally, the programmer should not be bothered with the low-level details of 
parallel execution. But often speedups can only be achieved when one is able 
to control the costs introduced by processes, communication, and data distribu- 
tion. Therefore it is necessary to find a level of abstraction which gives program- 
mers enough control to implement their parallel algorithms efficiently (including 
granularity issues) and at the same time frees them from the low-level details of 
process management. 

The parallel functional language Eden0 [3I4J provides such a level of abstrac- 
tion. Eden is explicit about processes and their input and output data, but 
abstracts from the communication of data between processes and the synchro- 
nisation required. The Eden implementation is a freely availabl^il distributed 
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implementation that maps process structures directly to the underlying archi- 
tecture giving the programmer more control over work and data distribution. 

In this paper we show methods to support the construction of efficient par- 
allel programs in Eden. This is a first step towards a programming methodology 
for Eden programmers which shows how to use the language effectively. Our 
approach is based on the concept of skeletons |^. A skeleton is a high-level par- 
allel programming construct (often a polymorphic higher-order function) which 
represents a common parallel computation pattern. Associated with a skeleton 
are usually specialised, efficient low-level implementations for various parallel 
machines called architectural skeletons. We propose to consider an intermediate 
layer of implementation skeletons between the high-level algorithmic skeletons 
and the low-level architectural skeletons. As a case study we introduce various 
parallel Eden implementation skeletons for the well-known algorithmic skeleton 
map. Being able to define the abstract specification of a skeleton as well as its 
parallel implementations in the same declarative language gives a solid basis 
for proving the correctness and other properties of skeleton implementations. 
We present runtime results of several realistic benchmark programs which have 
been parallelised using the implementation skeletons on a network of worksta- 
tions and on a Beowulf cluster. These architectures are examples of extremely 
high-latency, low-bandwidth parallel machines, favouring coarse-grained compu- 
tation with minimal communication. Our test cases have been a ray tracer, a 
computer algebra algorithm for solving linear equation systems and the calcu- 
lation of Mandelbrot sets. Although it is difficult to achieve speedups in the 
presence of high communication costs, Eden provides enough control to cope 
with this obstacle. 

The main contributions of this paper are the introduction of parallel imple- 
mentation skeletons which enable parallel programming with low effort in Eden 
(and in every other language in which such skeletons can be expressed) and the 
presentation of runtime results which reveal reasonable absolute speedups on 
high-latency systems for the non-trivial benchmark programs mentioned above. 

The next section introduces the key features of Eden which are necessary to 
understand the implementation skeletons defined in Section El Section Slpresents 
the results of our experiments. The paper finishes with a discussion of related 
work and conclusions. 

2 Eden 

Eden extends the lazy functional language Haskell m with syntactic con- 
structs for explicitly defining processes |4]. Eden’s process model provides direct 
control over process granularity, data distribution and communication topology. 

Defining Processes. An Eden process maps inputs in_l, . . . , injn to outputs 
out_l, . . . , out_n. Its behaviour is specified by a process abstraction with Process 
being a newly defined type constructor: 

p :: Process (it_l , . . . , it_m) (ot_l , . . . , ot_n) 
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p = process (in_l , . . . , in_m) -> (out_l , . . . , out_n) 
where equation_l . . . equation_r 

Example 1. A function f : : a -> b can be embedded into a process abstrac- 
tion via the function mkProc: 

mkProc : : (a -> b) -> Process a b 
mkProc f = process x -> f x 

The function argument will be communicated to a process generated using 
mkProc f. <l 

Processes are dynamically created using process instantiations. A process instan- 
tiation provides a process abstraction with actual input parameters. Its evalu- 
ation leads to the creation of a process together with its interconnecting com- 
munication channels. Processes communicate via unidirectional channels which 
connect one writer to exactly one reader. We use the operator 

# : : (Transmissible a. Transmissible b) => 

Process a b -> a -> b 

for process instantiation. The context Transmissible a ensures that functions 
for the transmission of values of type a are available. In the equation 

(out_l,..., out_n) = p # (inexp_l , . . . , inexp_m) 

a process abstraction p is instantiated with a tuple of input expressions, yielding 
a tuple of outputs. 

Example 2. The higher-order function map 

map : : (a -> b) -> [a] -> [b] 

map f xs = [f X I X <- xs] 

can be lifted to a parallel setting by using a process abstraction instead of a 
function as first parameter. This process abstraction is instantiated with each 
element of the list of input data yielding a list of results: 

parMap : : (Transmissible a. Transmissible b) => 

Process a b -> [a] -> [b] 
parMap p xs = [p # x I x <- xs] 

The function parMap can be combined with mkProc to construct a parallel map 
which has (apart from the type context) the same type as map: 

map_par : : (Transmissible a, Transmissible b) => 

(a -> b) -> [a] -> [b] 
map_par = parMap . mkProc 

Preserving the original type of map makes it easy to parallelise sequential pro- 
grams by replacing appropriate occurrences of map by map_par. <i 

A predefined nondeterministic process merge is provided for many-to-one com- 
munication in process systems. The merge process takes a list of input streams 
and merges the incoming values in the order in which they arrive into a single 
output stream. 
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Evaluating Processes. Each Eden process evaluates its output expressions to 
normal form and sends the results on the corresponding communication channels. 
Lists are transmitted as streams. This means that in two respects additional 
demand is introduced in favour of parallelism: 

1. Evaluation to normal form is used for process outputs instead of evaluation 
to weak head normal form (WHNF). 

2. Communication is not demand-driven. Values are sent to the receiver process 
without that the latter has to request for them. In general terms, Eden 
employs pushing instead of pulling of information. 

To achieve an early instantiation of a process system it is necessary to impose an 
appropriate demand on the expression describing the process system [Idj . This 
can be achieved by using evaluation strategies [21] . A parMap version which ea- 
gerly creates all its processes when evaluated to WHNF uses the strategy spine 
to force the evaluation of the result list’s spine: 

parMap : : (Transmissible a. Transmissible b) => 

Process a b -> [a] -> [b] 
parMap p xs = [p # x I x <- xs] 'using' spine 



3 Implementation Skeletons 

Often a given algorithm can be parallelised in many ways. The most suitable 
parallelisation depends heavily on the kind of inherent parallelism (granularity) 
and the characteristics of the parallel machine executing the program. In this 
section we identify three alternative parallelisations of the higher-order function 
map. The parallelisations are defined in a machine-independent way as imple- 
mentation skeletons^ a new concept we introduce in the following. 

An algorithmic skeleton is a higher-order scheme that abstracts from the 
details of a (parallel) algorithm and defines the general pattern of a (parallel) 
computation [5j. Well-known skeletons are e.g. the divide-and-conquer scheme 
and the data-parallel process farm. In skeletal programming one often distin- 
guishes between three aspects of skeletons [0] : 

— the higher-order function defining a general computation scheme with inher- 
ent parallelism (the proper algorithmic skeleton) 

— different parallel implementations for different target architectures (archi- 
tectural skeletons) 

— a cost (performance) model to estimate the execution time. 

In this paper we concentrate on the first two aspects. In our context differ- 
ent parallel implementations are not related to different architectures. Therefore 
the notion of architectural skeleton is inappropriate and we introduce the term 
implementation skeleton instead. An implementation skeleton is an architecture- 
independent scheme that describes a parallel implementation of an algorithmic 
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skeleton on a higher level of abstraction than an architectural skeleton, the latter 
usually being optimized for a special target architecture. Often parallel imple- 
mentations of an algorithmic skeleton are only described informally and hidden 
from the user of the skeleton (see e.g. EE]). In Eden, it is possible to describe the 
algorithmic skeleton and its parallel implementations in the same language con- 
text. This constitutes a good basis for formal reasoning and correctness proofs, 
although this is not further elaborated in this paper. 

In most parallel implementations of map the input list is seen as a task queue 
that can be processed using several processor elements (PEs). We have already 
seen a straightforward parallelisation of map, parMap, in Section [3 parMap cre- 
ates a new process for each task. This simple approach is not always well suited, 
especially in the presence of fine-grained or irregular tasks. Alternative paral- 
lelisations of map found in the literature use a fixed number of worker processes 
which process a subset of tasks each. They differ in the way the tasks are dis- 
tributed among the worker processes: 

— static task distribution 

farm: an equal number of tasks is sent to each worker 

direct mapping: each worker computes the task list, and selects part of it 

— dynamic task distribution (distribution on demand) 

workpool: send a new task to a worker only when it has finished a task. 

A visualisation of the parallelisations parMap, farm and direct mapping is given 
in Fig. [TJ These schemes can be used for regular problems with equally complex 
subtasks while the workpool scheme is better suited for problems where the task 
granularity is irregular and load balancing problems can occur. 




Parallel Map 



Farm 



Direct Mapping 



Fig. 1. Implementation skeletons for problems with regular task granularity 



In the following we define implementation skeletons for these different methods of 
parallelising the map function. Note that the simple skeleton map_par has already 
been defined in Section 2. The original type interface of map is always retained, 
although this is not essential for implementation skeletons. It has, however, the 
advantage that the parallelisation of calls to map simply consists of replacing map 
by one of the implementation skeletons. 



3.1 Static Task Distribution: Farm 

While the simple parMap defined in Section 2 creates a new process for each 
task, the farm scheme creates only a finite number np of processes and assigns 
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every process more than one task. It uses parameter functions distribute and 
combine to distribute the tasks to the processes and to combine the results. The 
following Eden function defines the farm scheme: 

farm : : (Transmissible a, Transmissible b) => 

Int -> (Int->[a]->[[a]]) -> ([[b]]->[b]) -> 

Process [a] [b] -> [a] -> [b] 

farm np distribute combine proc tasks 

= combine (parMap proc (distribute np tasks)) 

This definition is very general as it takes the number of processes and functions 
for distributing the tasks and combining the results as parameters. The process 
creation is done using parMap. A round robin task distribution is defined by the 
function unshuffle which on n PEs, assigns every n-th task (starting with offset 
i) to the Tth process (see Fig. m). Assuming regular task granularities or at least a 
random distribution of task granularities, and assuming that the number of tasks 
is much larger than the number of processes, every process gets a representative 
set of tasks and an unequal work distribution is avoided. Note that unshuffle 
is incremental in the sense that it supplies every process with at least a part 
of its input as soon as possible and its definition allows for a parallel access 
to the sublists. The function shuffle in Fig. E] is a counterpart of the function 
unshuffle which collects the results and restores the original order. It also works 
incrementally. When the order of the list elements need not be maintained, one 
can apply faster merging functions or even Eden’s nondeterministic fair merge 
process. 



unshuffle 




Int -> [a] 


-> [[a]] 






unshuffle n 


xs = 


[ takeEach 


n (drop i 


xs) 1 i <- 


[0..n-l] ] 


where 


takeEach : 


Int -> [a] 


-> [a] 








takeEach n 


□ = [] 










takeEach n 


(x:xs) = X 


takeEach 


n (drop (n 


-1) xs) 


shuffle 




:: [[a]] -> [a] 








shuffle 


□ 


= [] 










shuffle 


xxs 


= (map head xxs) ++ shuffle 


(map tail 


xxs) 



Fig. 2. Distribution and Composition Functions 



Another implementation skeleton for map can now be defined using the farm 
scheme and the round robin task distribution defined by unshuffle. The farm 
scheme is applied to the number noPE of processor elements available, where noPE 
is a constant provided by the Eden runtime system. The process placement in 
the Eden system will place one process on every processor element. 
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map_farm : : (Transmissible a, Transmissible b) => 

(a -> b) -> [a] -> [b] 

map_farm = (farm noPE unshuffle shuffle) . (mkProc . map) 

There are two differences to the simple parallelisation shown in Section 2: Firstly, 
the total number of processes is fixed and set to the number noPE of processor 
elements. By bundling tasks this scheme increases the granularity and creates less 
processes than the parMap approach, saving process creation and communication 
costs. Secondly, farm can save memory compared to the simple parallelisation. 
If there are data structures which are used by all tasks, then parMap will have 
multiple copies of the data per Phi, while in the farm approach there will only 
be a single copy on each PE. 

In the following we discuss the possibility of replacing the communication of 
tasks by the recomputation of the task list in each process. We call this scheme 
direct mapping, because the original computation is simply divided into as many 
parts as there are PEs and the parts are directly mapped onto the PEs. 



3.2 Static Task Distribution: Direct Mapping 

Like farm, the direct mapping scheme creates one process per PE each working 
on a subpart of the task list. But each process lazily evaluates the whole task 
list and selects its own subpart of it. This local task list evaluation and selection 
saves a considerable amount of communication. The price one has to pay is the 
partial recomputation of the task list by each process. But especially on systems 
with high communication costs, this putative overhead pays off. Replacing com- 
munication by recomputation is a well-known parallel programming technique. 
The direct mapping scheme makes this technique available to parallel functional 
programmers in an easy to handle way. 

The trick is to pass the task list to the processes not as input but as a 
parameter. Each process can then compute the part of the task list determined 
by its unique process identifier pid and the total number of processes np: 

dm : : (Transmissible b) => 

Int -> (Int->[a]->[[a]] ) -> ([[b]]->[b]) -> 

( [a] ->Process () [b] ) -> [a] -> [b] 

dm np distribute combine proc tasks 

= combine [ proc (extract pid np tasks) # () 

I pid <- [0..np-l] ] ‘using' spine 
where extract i np ts = (distribute np ts) ! ! i 

A corresponding direct mapping implementation skeleton for map which uses the 
functions unshuffle and shuffle for distributing tasks and combining results 
is defined by: 

^ Sometimes this copying can be avoided by declaring the data at top level. In this 
special case the compiler, as an extension of the Glasgow Haskell Compiler (GHC), 
will keep only one copy of the data in each PE. 
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map_dm : : 


(Transmissible b) => 




(a -> b) -> [a] -> [b] 


map_dm = 


(dm noPE unshuffle shuffle) 


rfi : : 


(Transmissible b) => 




(a -> b) -> a -> Process () 


rfi f X = 


process () -> (f x) 



Similar to mkProc, the function rf i (remote function invocation) maps a function 
to a process abstraction. The function argument is however not communicated 
as in mkProc but passed as a parameter to the process abstraction. As in the 
farm scheme the tasks have to be of equal complexity to avoid a load imbalance. 
Tasks of different complexity can be dealt with by employing a workpool which 
is described next. 

3.3 Dynamic Task Distribution: Workpool 

For so called irregular parallelism, i.e. tasks of different granularity, we need an 
approach different to the ones presented before, which were static in the sense 
that every process was working on a predefined subset of tasks. To cope with 
extremely variable task granularities one usually employs a workpool with a set 
of tasks which are dynamically assigned to free worker processes. In our version, 
every PE hosts one worker process, which iteratively 1) receives a task from the 
workpool, 2) computes the result and 3) sends back the result. The results are 
interpreted as new requests for work and composed to produce the overall result. 
The flow of data together with their types is shown in Fig. [31 




Fig. 3. The workpool scheme 



The Eden program which implements the workpool is shown in Fig. 2] The tasks 
are spread across the worker processes using the distribute function defined in 
Fig. [31 The distribution depends on which workers have delivered the result of 
their previous computation. Initial requests initialReqs are generated depend- 
ing on the prefetch argument to avoid workers from being idle while waiting 
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type Pid = Int — Process Ids 
type Tid = Int — Task Ids 

workpool : : (Transmissible a, Transmissible b) => 

Int -> Int -> (Pid -> Process [(Tid, a)] [(Pid, (Tid,b) )] ) 
-> [a] -> [b] 

workpool np prefetch proc tasks 

= sortMerge (map (map (\ (pid, res) -> res)) fromWorkers) 

where fromWorkers = [ proc pid # ts I (pid,ts) <- toWorkers] 
toWorkers = zip [0..np-l] taskQueues 

taskQueues = distribute np requests numberedtasks 

numberedtasks = zip [0..] tasks 

requests = initialReqs ++ newReqs 

initialReqs = concat (replicate prefetch [0..np-l]) 

newReqs = merge # (map (map (\ (pid, res) -> pid)) 

fromWorkers) 



Fig. 4. Workpool 



to receive a new task. For the programs measured to date, a prefetch argument 
of 2 has been sufficient. More elaborate versions for this workpool scheme with 
dynamic task creation and termination detection have been presented in [Ill- 
Using the workpool process scheme of Fig. [Done can define the following imple- 
mentation skeleton for map: 

map_wp : : (Transmissible a, Transmissible b) => 

(a -> b) -> [a] -> [b] 
map_wp = (workpool noPE 2) . workerProc 

workerProc :: (a -> b) -> Int -> Process [(Int, a)] [(Int , (Int ,b) )] 
workerProc f pid 

= process numberedTasks -> map solveTask numberedTasks 
where solveTask (nr,t) = (pid, (nr, f t)) 

Note that the workpool scheme is not only worthwhile for irregular problems but 
also when working on non-uniform systems with processors of different speed. For 
our measurements we used however only uniform systems with equally equipped 
processors. 



3.4 Increasing Task Granularity 

A well-known technique to increase the granularity in fine-grained problems is 
to put several tasks together to form chunks of tasks or macro-tasks. Working 
with coarse-grained macro-tasks instead of fine-grained tasks reduces the com- 
munication overhead substantially and thus improves the runtime behaviour on 
high-latency distributed systems. The following function embeds implementation 
skeletons into ones with increased task granularity. 
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distribute 






: : Int -> [Int] -> [a] -> [ [a] ] 


distribute 


np 


rs 


[] = replicate np [] 


distribute 


np 


□ 


ts = unshuffle np ts 


distribute 


np 


(r :rs) 


(t:ts) = (take r tqs) ++ 








( (t : (tqs ! ! r) ) : (drop (r+1) tqs)) 








where tqs = distribute np rs ts 


sortMerge 




:: [[(Int, a)]] -> [a] 


sortMerge xss 


= sm 


(\ X y -> (fst x) < (fst y) ) (sortMerge odds) 








(sortMerge evens) 


where [odds , 


evens] 


= unshuffle 2 xss 


sm 






: : (a->a->Bool) -> [a] -> [a] -> [a] 


sm 


P 


[] 


ys = ys 


sm 


P 


xs 


[] = xs 


sm 


P 


(x : xs 


) (y:ys) = if (p x y) then (x : sm p xs (y:ys)) 








else (y : sm p (x:xs) ys) 



Fig. 5. Auxiliary functions for the workpool 



macro :: Int -> (Int -> [a] -> [[a]]) -> ([[b]] -> [b] ) 

-> (([a] -> [b]) -> [[a]] -> [[b]]) 

-> (a -> b) -> [a] -> [b] 
macro size decompose compose mapscheme f xs 

= compose (mapscheme (map f) (decompose size xs)) 

The parameter function decompose is meant to divide the input list into subparts 
of a given size. The parameter function compose should be the counterpart of 
decompose, i.e. the equation compose (chunk size xs) == xs should be valid 
for finite lists xs. Especially the parMap scheme can take profit from macro-tasks 
as the number of processes is reduced to the number of macro-tasks: 

map_par_macro = macro size chunk concat map_par 
where chunk : : Int -> [a] -> [ [a] ] 
chunk k [] = [] 

chunk k xs = (take k xs) : chunk k (drop k xs) 

For an input list with length k this scheme will produce [fc/size] processes 
where size should be defined globally. Using unshuffle for chunking, although 
unshuffle will not produce chunks of the given size but size chunks, and 
shuffle for combining will make the macro task version of parMap identical to 
the farm scheme: 

macro noPE unshuffle shuffle map_par = map_farm. 

4 Experimental Results 

Eden has been implemented by extending the Glasgow Haskell Compiler, ver- 
sion 3.02. The Eden runtime system consists of a set of multi-threaded abstract 
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machines, each of which is mapped to a separate processor element (PE) on a 
multi-computer system . The system can be used with the message passing li- 
braries MPI (Message Passing Interface) |20] or PVM (Parallel Virtual Machine) 
|8]. A special Eden prelude module has been constructed which implements much 
of Eden’s functionality in Eden itself. 

In the following we will explore the implementation skeletons in the context 
of several examples on a network of workstations (NOW) and discuss their effi- 
ciency. Afterwards we show a few additional measurements on a Beowulf clusteiEl 
located at the University of St. Andrews. The two systems, which both repre- 
sent low-cost high-performance parallel computing platforms, have the following 
characteristics: 



System 


CPU (MHz) 


OS 


Mem 


#Nodes 


Ethernet 


Latency 


NOW 


UltraSparc Hi (300) 


Solaris 2.6 


128MB 


6 


10MBit 


282/rs 


Beowulf 


Pentium II (450) 


Linux RH 6.2 


384MB 


64 


100MBit 


142/rs 



The latency on both systems is relatively high, favouring coarser grained parallel 
programs. All experiments (including determination of the latencies) were done 
under PVM 3.4.2. To rule out speedup gains due to the usage of more memory in 
the parallel setting (less garbage collections compared to the sequential version), 
we used only 40 MB on each processor node. This keeps the time for garbage 
collection under 5% of the sequential runtime. 

The examples have been compiled with full optimization (flag -02). We show 
the runtimes of different parallelisations and give the absolute speedups, i.e. com- 
paring parallel runtimes with the runtime of the corresponding purely sequential 
programs containing no parallel machinery. The runtimes are given as the aver- 
age over five runs to rule out erroneous measurements due to temporary network 
traffic caused by other services. We compare the effectiveness of the implemen- 
tation skeletons for tasks with equal or varying granularity. A more complete 
presentation of the experiments described in this section will be contained in the 
forthcoming thesis m- 



4.1 Regular Tasks: A Simple Ray Tracer 

Given a scene consisting of 3D objects and a camera, a ray tracer calculates 
a 2D image of the scene. For every pixel of the output image, the ray tracer 
shoots a ray into the scene and tests whether it impacts with any object of the 
scene. In this simple version object properties like reflection and transparency 
are neglected. Therefore a pixel gets the color of the object its corresponding 
ray hits first; if it hits none at all, it gets the color of the background. The scene 
itself is maintained as an unordered list of objects, so that the effort to compute 
the nearest impact with the objects is nearly the same for all rays. The main 
function of the ray tracer is top: 

top : : ScreenSize -> CPos -> [Objects] -> [Impact] 
top detail cameraPos scene 

® EPSRC Research Grant No. GR/M 32351 



82 



Ulrike Klusik et al. 



= map (firstimpact scene) allRays 

where allRays = generateRays detail cameraPos 

This simple algorithm contains fine-grained parallelism with tasks of the same 
time complexity and depends only on medium-sized globally shared data (the 
scene) . The top level parallelism is inherent in the call of the function map which 
can be parallelised using the four implementation skeletons introduced in the 
previous sections. 

Results. For the runtime measurements we used a screen of 400 x 400 pixels 
with a scene consisting of 17 objects, so the shared data was relatively small. 
From the runtimes and absolute speedups (shown in Fig. [6|) we see that the 
direct mapping scheme performs best. The reason is that for this version only 
half of the communication is needed, as the rays for each process are given in a 
compact format. 




number ol PEs 



runtimes (in seconds) 


# PEs 


parmap 


farm 


dm 


workpool 


1 


65.36 


64.58 


64.02 


67.96 


2 


42.13 


41.26 


35.10 


43.20 


3 


33.76 


32.64 


24.64 


30.56 


4 


30.28 


28.65 


19.76 


25.08 


5 


27.76 


25.88 


16.98 


25.80 


6 


26.38 


25.20 


15.08 


26.40 



sequential runtime: 54.78s 



Fig. 6. Absolute speedups and runtimes of ray tracer on the NOW 



All the other schemes take nearly the same time. parMap performs worst, as 
for each task we have to copy the scene within the process abstraction. The 
difference is not big, as the scene is still small. The workpool (with prefetch 
2) on one PE is slower than the other schemes due to additional computations 
caused by the sequence numbers added to the tasks to be able to reconstruct 
them in the correct order. Moreover, more work for the task distribution is done 
in this scheme. 

4.2 Irregular Tasks: An Exact Linear System Solver 

The algorithm LinSolv discussed in this section finds an exact solution of a linear 
system of equations of the form Ax = b where A € € TZ"" . LinSolv 

uses a multiple homomorphic images approach which consists of the following 
three stages: map the input data into several homomorphic images; compute the 
solution in each of these images; and combine the results of all images to a result 
in the original domain (‘lifting’). The original LinSolv program was written by 
Hans- Wolfgang Loidl [I3- 

The advantage of this approach is that calculations over arbitrary preci- 
sion integers are quite expensive, thus the main computation can be performed 
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cheaper in the smaller domains (7Z modulo p), where it is possible to work 
with the standard integers. Each 7Z^ is determined by a different prime num- 
ber p. The computations on these domains can be performed in parallel and 
be expressed in a map-like fashion. Cramer’s Rule was chosen as the basic lin- 
ear system solver as it computes the exact solution and provides again a high 
potential for parallelism. The solution is computed by 

Xpj = det A'pj / det Ap, j = 1, . . . , n 

where the matrix AC is Ap (A modulo p) with the j-th column replaced by vector 
bp (6 modulo p). Each of the determinants can be computed independently. In 
the algorithm we have to take care that we do not take prime numbers for 
which det Ap is zero. We need to filter out these unlucky prime numbers after 
computing the determinant in the homomorphic image. This results in tasks 
with two different granularities: small ones, only computing det Ap, and big 
ones, computing n determinants more. 

Results. The measurements were taken for linear systems of various sizes, with 
elements not greater than 2000. As an example we show the results for a linear 
system with dimension 8. For this system we get only few tasks as 21 lucky 
primes are sufficient to find the exact solution. If there are more PEs than tasks 
it would be necessary to also parallelise the computation of the homomorphic 
solutions, e.g. by computing the determinants in parallel. The runtimes and 
speedups are given in Fig. |7l 




runtimes (in seconds) 



#PEs 


farm 


dm 


workpool 


1 


240.96 


123.84 


115.04 


2 


113.54 


114.60 


61.70 


3 


84.04 


110.58 


40.00 


4 


52.30 


106.12 


33.48 


5 


51.18 


101.48 


25.72 


6 


43.96 


101.92 


24.86 



sequential runtime: 93.66s 



number of PEs 



Fig. 7. Absolute speedups and rnntimes of LinSolv on the NOW 



Although there are only few tasks, the workpool version delivers the best re- 
sults. This has two reasons: first of all because the load can be balanced better, 
and what is probably more significant is that not too many primes are generated 
without real demand. The runtimes of the farm on one processor are unusually 
high, because the main process produces prime numbers without demand, and 
this thread gets the same amount of time as the local process solving the equa- 
tions and the thread computing the final lifting phase. The problem of the direct 
mapping version is the recomputation of the prime list in each worker process. 
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4.3 Compensating Irregularity: Mandelbrot Sets 

Mandelbrot sets can be used to generate fractal graphics. For each pixel of the 
screen, it is necessary to compute its color (in fact, its distance to the Mandel- 
brot set), which depends on the coordinates of the pixel. This computation is 
performed for each pixel by iterating a function until a good approximation is 
obtained, or until a maximum number of iterations has been reached. For some 
pixels it is sufficient to compute a few iterations, while for others the maximum 
number of iterations is needed. Therefore the problem’s inherent pixel-wise par- 
allelism is of irregular granularity. 

mandel : : Int -> Int -> [Int] — Using a (sizeX x sizeY) screen 
mandel sizeX sizeY = map calcPix (range (0,0) (sizeX-1 , sizeY-1) ) 

range :: (Int, Int) -> (Int, Int) -> [(Int, Int)] 

range (xl,yl) (x2,y2) = [(x,y) I x <- [xl..x2], y <- [yl..y2]] 

calcPix :: (Int, Int) -> Int 

calcPix (x,y) = loop (firstApprox (x,y)) 0 

The much too fine granularity has to be coarsened by grouping pixels into larger 
tasks. Time-consuming pixels may be clustered with this approach, which bears 
the risk of a serious load imbalance. 

Results. We have generated Mandelbrot sets for a 1024 x 1024 screen, using 
1000 as maximum number of iterations. Fig. |8]shows the results on the NOW. 




number of PEs 



runtimes 1 


in seconds) 


# PEs 


parmap 


farm 


workpool 


1 


235.00 


231.10 


227.92 


2 


131.94 


123.74 


121.92 


3 


93.42 


87.13 


83.44 


4 


75.44 


69.20 


64.70 


5 


69.22 


58.54 


54.83 


6 


58.80 


51.82 


46.88 



sequential runtime: 215.06s 



Fig. 8. Absolute speedups and runtimes of Mandelbrot on the NOW 



As expected, even when using a row-wise partitioning the workpool scheme is the 
best suited for the original inherently irregular problem. But parMap and farm 
can also produce good speedups if we use a different partitioning. When rows are 
combined with unshuffle these schemes can be nearly as good as workpool. The 
reason is that the round robin distribution with unshuffle often compensates 
the irregularity of the rows. There is a big difference in the computation needed 
for different rows, but neighbouring rows usually need similar times. Hence, 
unshuffling the tasks leads to bigger tasks of a more regular granularity, giving 
workpool only a small advantage over the other schemes. 
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4.4 Measurements on a Beowulf Cluster 

We had the opportunity to repeat some of our measurements on the St. Andrews 
Beowulf cluster. As can be seen from the absolute speedups achieved on up to 
ten PEs for the ray tracer with the same input parameters as before (see Fig. Ei 
and for the LinSolv example with the 8x8 input matrix on up to 25 PEs (see 
Fig. II 0|l . the implementation skeletons show the same relative performance in 
comparison to each other. The direct mapping scheme performs best for the 
regular tasks of the ray tracer, while the workpool scheme copes best with the 
irregularity within the LinSolv program. 




number o1 PEs 



number of PEs 



Fig. 9. Absolute speedups of the simple Fig. 10. Absolute speedups of LinSolv 

ray tracer on the Beowulf cluster with 8x8 input matrix on the Beowulf 

cluster 



4.5 Summary 

In this section we have considered examples with inherent data parallelism. The 
problems can be distinguished by the number of tasks and their granularity. We 
had two problems with fine-grained parallelism: the ray tracer and the Mandel- 
brot sets. LinSolv is coarse-grained as there are only a few expensive tasks. To 
observe the impact of load balancing we had two problems with varying task 
granularity: In Mandelbrot between 0 and 1000 iterations are needed per pixel, 
whereas in LinSolv there are two different complexities. The complete charac- 
terization is shown in Table [T] 

Table 1. Characterization of the problems 



problem 


granularity 


decomposition 


best scheme 


mandelbrot 

raytracer 

linsolv 


fine 

fine 

coarse 


irregular 

regular 

irregular 


workpool 
direct mapping 
workpool 



The workpool is the best choice if the task granularity is varying, although the 
maintenance of the workpool is not for free. If all tasks have the same granularity 
the direct mapping yields the best results, because we can save communications 



86 



Ulrike Klusik et al. 



by implicitly passing the task description as a parameter. As a consequence of 
increasing the task granularity by the use of macro-tasks in parMap, the parMap 
and farm schemes show a very similar behaviour. The results obtained on the 
Beowulf cluster show that the direct mapping and the workpool scheme behave 
equally well on large numbers of processor elements. 

4.6 Further Measurements on the Beowulf Cluster 

Additional measurements on the St. Andrews Beowulf cluster with up to 61 
processor nodes have shown that Eden achieves a good scalability for complex 
problems. We have studied a more complex ray tracer for spheres and the LinSolv 
program explained before. The spheres ray tracer differs from the ray tracer 
described in Subsection id.ll in that it is restricted to spherical objects, but takes 
into account reflection and light conditions. The original version of this algorithm 
has been ported from the Impala suite of implicitly parallel program^. 
Spheres Ray Tracer Results. Fig. [TT] shows the speedups obtained for the ray 
tracer program with a direct mapping parallelisation for screens with dimension 
350 (sequential runtime: 213.42s) and 500 pixels (sequential runtime: 435.01s). 
The largest speedup is 27 on 37 PEs for the larger screen size. The speedup 
increases almost linearly until about 40 processor elements. Then a dramatic 
decrease is observed which is due to an increase of the startup time of the Eden 
system on the Beowulf cluster. The startup problems on more than 40 processors 
have not been solved yet. Nevertheless these results are very promising for what 
is a relatively immature implementation. 





Fig. 11. Absolute speedups of spheres speedups of LinSolv 

ray tracer on the Beowulf cluster i ^ t a ■ a a ■ v> 

with 14 X 14 input matrix on the Be- 
owulf cluster 

LinSolv Results. Fig. [T2| shows the speedups obtained for the LinSolv algo- 
rithm described in Subsection O The measurements used a sparse matrix with 
dimension 14 (sequential runtime: 552.59s) and a workpool parallelisation. Mea- 
surements were taken for up to 61 PEs and show good performance: The largest 
speedup is 35 on 61 PEs. 

URL: http : //www . csg. Ics .mit . edu/impala/ 
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5 Related Work 

In the last decade a great variety of different approaches to the combination of 
functional programming and parallelism has evolved. A comprehensive collec- 
tion of introductory chapters and surveys of current research projects has been 
published by Hammond and Michaelson [TO] . Besides many other approaches to 
efficient parallel functional programming the concept of algorithmic skeletons jS] 
has received a lot of attention. For efficiency, skeletons are often implemented in 
a low-level language different from the language in which they are used. There- 
fore skeleton implementations tend to be a ’’black box” for the programmer, 
which can make them hard to use in an efficient way. Well-known approaches 
to introduce skeletons in a parallel language include: Darlington et al. j^, P^L 
m, Skil pp, and others. As Eden, Skil allows to design new skeletons in the lan- 
guage itself, the main difference being that Skil is based on the imperative host 
language C m- More closely related to our work are the following approaches: 

In GAPML [S] Michaelson, Hamdam et al. extend an ML compiler by ma- 
chinery which automatically searches the given program for higher-order func- 
tions which are suitable for parallelisation. During compilation these are replaced 
by efficient low-level implementations written in C and MPI. In HaskSkel [m. 
Hammond and Rebon Portillo combine the evaluation strategies of GpH [21] 
with Okasaki’s Edison library [TB] (which provides efficient implementations of 
data structures) to implement parallel skeletons directly in GpH. 

6 Conclusion 

We have introduced four implementation skeletons for the higher-order function 
map which are based on the schemes parMap, farm, direct mapping, and workpool. 
Depending on problem characteristics like degree and granularity of inherent 
parallelism, static or dynamic evolution of parallelism etc. the programmer can 
choose the appropriate skeleton to obtain an efficient parallel program. The 
schemes have proven to be helpful during program construction and efficient 
during program execution. 

This is supported by runtime results measured with the Eden system on two 
different high-latency parallel machines. The current implementation contains 
the basic functionality without optimisations, but it is already mature enough 
to deliver reasonable speedups with several examples on a conventional network 
of workstations and on a Beowulf cluster. 
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Abstract. Curry combines the concepts of functional, logic and concur- 
rent programming languages. Concurrent programming with ports allows 
the modeling of objects in Curry similar to object-oriented programming 
languages. In this paper, we present ObjectCurry, a conservative exten- 
sion of Curry. ObjectCurry allows the direct dehnition of templates which 
play the role of classes in conventional object-oriented languages. Objects 
are instances of a template. An object owns a state and reacts when it 
receives a message — usually by sending messages to other objects or a 
transformation of its state. ObjectCurry also provides inheritance be- 
tween templates. Furthermore, we show how programs can be translated 
from ObjectCurry into Curry by exploiting the concurrency and distri- 
bution features of Curry. To implement inheritance, we extend the type 
system of Curry, which is based on parametric polymorphism, to include 
subtyping for objects and messages. 



1 Introduction 

Curry m is a multi-paradigm declarative language which integrates functional, 
logic, and concurrent programming paradigms (see [3] for a survey on integrated 
functional logic languages). The syntax of Curry is similar to Haskell [TS], e.g., 
functions are defined by rules of the form “/ ti . . .t^ = e” where / is the function 
to be defined, ti, ... ,tn are the pattern arguments, and e is an expression which 
replaces a function call matching the left-hand side. In addition to Haskell, local 
names introduced in let and where clauses can be declared as “free” which 
means that their value is unknown. Such free or logical variables in expressions 
supports logic programming features like partial data structures and search for 
solutions. Furthermore, functions in Curry can be defined by conditional equa- 
tions “Z I c = r” where the condition c is a constraint (an expression of the 
predefined type Success) which must be solved in order to apply the equation. 
Basic constraints are “success” (the always satisfiable constraint) and equa- 
tional constraints of the form “ei = : = 62 ” which are satisfied if both sides ei 

* This research has been partially supported by the German Research Council (DFG) 
under grant Ha 2457/1-2 and by the DAAD under the PROCOPE programme. 

M. Mohnen and P. Koopman (Eds.): IFL 2000, LNCS 2011, pp. 89 41061 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 



90 



Michael Hanus, Frank Huch, and Philipp Niederau 



and 62 are reducible to the same value (data term). More complex constraints 
can be constructed with the concurrent conjunction operator &. A non-primitive 
constraint like “ci & C 2 ” is solved by solving both constraints Ci and C 2 concur- 
rently. Finally, “ci &> C 2 denotes the sequential conjunction of two constraints, 
i.e., first the constraint ci is solved and, if this was successful, the constraint C 2 
is evaluated. 

Using both functional and logic features of Curry, it is possible to model 
objects with states (see Section [2l) at a very low level. Therefore, we propose 
an extension of Curry, called ObjectCurry, which provides all standard features 
of object-oriented programming, like (concurrent and distributed) objects with 
state that can be defined by class templates and inheritance between templates. 

This paper is structured as follows. In the next section, we review the model- 
ing of concurrent objects in Curry as proposed in [5j . We present ObjectCurry in 
the subsequent section and show the translation of ObjectCurry programs into 
Curry in Sect. 01 Section 0| describes an extended type system for ObjectCurry 
in order to detect type errors related to inheritance at compile time before we 
discuss related work in Sect. Eland conclude in Sect. El 

2 Implementing Objects in Curry 

It is well known from concurrent logic programming |16| that objects can be 
easily implemented as predicates processing a stream of incoming messages. The 
internal state of the object can be implemented as a parameter which may change 
in recursive calls when the message stream is processed. Since constraints play 
the role of predicates in Curry, we consider objects as functions with result type 
Success. These functions take the current state of the object and a stream 
of incoming messages as arguments. If the stream is not empty, the “object” 
function calls itself recursively with a new state, depending on the first element 
of the message stream. Thus, 

o :: Stated IMessageType] — ?> Success 

is the general type of an object where State is the type of the internal state of 
the object and MessageType is the type of messages. Usually, we define a new 
algebraic data type for the messages. 

The following example shows a counter which understands the messages Inc, 
Set s, and Get v. Thus, we define the data type 

data CounterMessage = Inc I Set Int I Get Int 

The counter has an integer value as an internal state. Receiving Inc increments 
the internal state and Set s assigns it to a new value s. To get the current state 
of the counter as an answer, we send the message Get v to the object where v 
is a free logical variable. In this case the counter object binds this variable to its 
current state: 

counter : : Int -> [CounterMessage] -> Success 
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The evaluation of the constraint “counter 42 s” creates a new counter object 
with initial value 42. Messages are sent by instantiating the variable s. The 
object terminates if the stream of incoming messages is empty. In this case 
the constraint is reduced to the trivial constraint success. For instance, the 
constraint 

let s free in counter 41 s & s=:=[Inc, Get x] 
is successfully evaluated where x is bound to the value 42. The annotation 
counter eval rigid 

marks counter as a rigid function. This means that an expression “counter x s” 
can be reduced only if s is boundQ 

If there is more than one process sending messages to the same counter object, 
it is necessary to merge the message streams from different processes into a single 
message stream. Doing that with a merger function causes a set of problems as 
discussed in m- Therefore, Janson et al. [H] proposed the use of ports for the 
concurrent logic language AKL which are generalized in to support distributed 
programming in Curry. In principle, a port is a constraint between a multiset 
and a stream which is satisfied if the multiset and the stream contain the same 
elements. In Curry a port is created by a constraint “openPort p s” where p 
and s are free logical variables. This constraint creates a multiset and a stream 
and combines them over a port. Elements can be inserted into the multiset by 
sending them to p. When a message is sent to p, it will automatically be added 
to the stream s in order to satisfy the port constraint. For sending a message, 
there is a constraint “send m p” where m is the message and p is a port created 
by openPort. 

Using ports, we can rewrite the counter example as follows 
openPort p s &> counter 0 s & (send Inc p &> send (Get x) p) 



3 ObjectCurry, an Object-Oriented Extension of Curry 

Using the technique presented above is troublesome and error-prone, in partic- 
ular, if the state consists of many variables, because the programmer has to 

^ In contrast to rigid functions, Curry also provides flexible functions which nondeter- 
ministically instantiate their arguments in order to allow the reduction of function 
calls, which provides for goal solving like in logic programming. As a default (which 
can be changed by eval annotations), constraints are flexible and all other functions 
are rigid. 
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repeat the whole state in the recursive calls. This motivated us to introduce 
some special syntax for defining templates. Templates play the role of classes 
in conventional object-oriented programming languages. We use the word “tem- 
plate” instead of class to avoid confusion between classes in an object-oriented 
meaning and Haskell’s type classes. For instance, a template for counter objects 
can be defined in ObjectCurry as follows: 

template Counter = 
constructor 

counter init = x := init 
methods 

Inc = X := X + 1 

Set s = X : = s 
Get V = V =:= X 

A template definition starts with the reserved keyword template followed by 
the name of the template. Similar to a data type declaration, the name of the 
template is its own type. The constructor is a function which we use to instan- 
tiate new objects. The left-hand side is constructed as in conventional function 
declarations. The right-hand side is a set of assignments describing the attributes 
of the object and their initial values. The assignments are consecutively written 
using the offside rule. 

The messages which are understood by the object and the reactions to these 
messages are defined by methods. Messages are defined similarly as the con- 
structor. The left-hand side of a method declaration consists of the name of 
the method followed by a list of patterns as in a function declaration and de- 
scribes the signature of a message with the same name as the method. The 
right-hand side describes the behavior of the object in response to receiving a 
message. A reaction can be a transformation of the internal state of the object. 
The transformation of a state can be expressed by a set A of assignments of the 
form “v := e”. If the tuple (ui, . . . ,w^) is the current state of the object where 
the template has the attributes vi,. . . ,u„, A specifies the state transformation 
(uj, . . . , v'J {v '{, . . . , O defined by 

„ _ f Ci a Vi : = Cl & A 

* v[ otherwise 

Additionally, the right-hand side of a method can also include constraints, i.e., 
expressions of the type Success, because constraints offer further possibilities 
to express reactions, e.g., equational constraints are used to yield an answer by 
binding a logical variable, or messages are sent to other objects by the send 
constraint. 

The assignments and constraints in the right-hand side of a method are 
treated as a set (where for each component of the state at most one assignment 
is allowed), i.e., they can be placed in any order: an assignment has no side effect 
to another assignment in the same method. 

A template definition introduces the type of the template, the constructor 
function and the messages at the top level of the Curry program. If T is the type 




ObjectCurry: An Object-Oriented Extension of Curry 93 

of the template and the constructor function has n arguments ri, . . . , r„, the 
type of the constructor function is 



Ti Tn Constructor T 

In a similar manner, a method has the type 

Ti r„ -i- Message T 

if it takes n arguments. Additionally, each object understands the predefined 
message Stop which terminates the object. 

To instantiate a template, there is a constraint 

new : : Constructor a — >■ Object a — >■ Success . 

new takes a constructor function and a free logical variable and binds the vari- 
able to a new instance of the template a. Messages can be sent to such an ob- 
ject using the constraint send : : Message a — >■ Object a — >■ Success. For 
instance, the evaluation of the following expression binds the variable v to the 
value 42: 

new (counter 41) o 

k (send Inc o &> send (Get v) o &> send Stop o) 

To give an object the possibility to send a message to itself, there is a predefined 
identifier self, self is visible in the right-hand side of each method and bound 
to the current object. Note that sending a message to self has no immediate 
side effect to the attributes of the object because the objects can only react to 
this message after the evaluation of the current method is finished. 

As a true extension to the modeling of objects in Curry as described in Sect.H] 
ObjectCurry also provides inheritance. A template can inherit attributes and 
methods from another template, which we call 'parent, where inherited methods 
can be redefined or new attributes and methods can be added. A supertemplate of 
a template T is T or one of its ancestors w.r.t. the parent relation. Subtemplates 
are analogously defined. 

For instance, we define a new template maxCounter which inherits the at- 
tribute X and the methods Inc, Set, and Get from counter. It also introduces a 
new attribute max which represents an upper bound for incrementing the counter. 
The method Inc will be redefined to avoid incrementing x to a value greater than 
max. Additionally, we define a new method SetMax v to set the upper bound: 

template MaxCounter extends Counter = 

constructor 

maxCounter init maxInit = counter init 

max : = maxInit 

methods 

Inc = X := (if x < max then x+1 else x) 

SetMax newMax = max : = newMax 

= (if x<max then x else max) 



X 
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The reserved keyword extends followed by the name of the parent specifies that 
the template inherits the attributes and methods from Counter. 

The first expression in the right-hand side of the constructor of a subtemplate 
must be the function call of the constructor of the parent. In this way the initial 
values of the inherited attributes are determined. 

Methods can be redefined by defining a method with the same name in the 
subtemplate. All methods which are not redefined will be inherited. 



4 Translating ObjectCurry into Curry 

To translate ObjectCurry programs into Curry, we basically use the technique 
presented in Sect. El An abstract data type Msg contains data constructors for 
each message defined in all templates and the additional message Stop. We 
decided to use only one data type for all messages to obtain a maximum of 
flexibility. Of course, ObjectCurry programs translated in this way are not type 
safe in a sense that messages can be sent to objects which cannot understand 
these messages. We will discuss this issue and propose a solution for this in 
Sect. El 

For our counter example, we generate one data type for all messages: 

data Msg = Inc I Set Int I Get Int I SetMax Int I Stop 

Next we define a function which defines the initial state of a new object. If the 
state of the object consists of more than one attribute, the state is implemented 
as a tuple. 

counterInitState init = init 

The initialization function of a subtemplate uses the initialization function of its 
parent to obtain the initial values for the inherited attributes: 

maxCounterInitState (init .maxInit) = 
let r_x = counterInitState init 
in (r_x,maxlnit) 

Given a state and a message, the following action function computes the next 
state defined by the corresponding method. 

counterAction x self Inc = State (x+1) 

counterAction x self (Set s) = State s 

counterAction x self (Get v) I v =:= x = State x 

counterAction x self Stop = Final 

We use the abstract data type “data State a = State a I Final” to distin- 
guish normal states and the final state. 

In a subtemplate, redefined and new methods are similarly translated: 

maxCounterAction (x,max) self Inc 

= State (if x < max then x+1 else x, max) 
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maxCounterAction (x,max) self (SetMax newMax) 

= State (if X < max then x else max, newMax) 

The action function of a subtemplate also contains an equation for each inherited 
method. Such an equation calls the action function of the parent of the template 
for receiving the next state: 

maxCounterAction (x,max) self (Get v) 

= let State r_x = counterAction x self (Get v) 
in State (r_x,max) 

maxCounterAction (x,max) self Stop = Final 

To create a new object, we use the constructor function and the new constraint. 
The constructor function determines the initial state of the object using the 
translated function for the initialization defined above and transfers the initial 
state and the action function of the object to a generic function loop which 
handles the recursive calls until the final state is reached: 

counter init self = 

loop (counterlnitState init) counterAction self 
For each template the same function loop is used which is defined by: 
loop eval rigid 

loop state action self (m:ms) = continuation nextState self ms 
where 

nextState = action state self m 

continuation (State ns) self ms = loop ns action self ms 
continuation Final _ _ = success 

The function new has a constructor function and a free logical variable as ar- 
guments. It creates a port to which the logical variable is bound and passes 
a stream associated with the port to the constructor function. Additionally, it 
passes the port to the constructor as the value for the identifier self: 

new constructor port = 
let stream free in 

openPort port streemi &> constructor port stream 

In the transformation, each message has the type Msg. Objects are represented 
by ports, so an object has the type Port Msg instead of Object Template. 

We have implemented a compiler for ObjectCurry which translates a program 
from ObjectCurry to Curry following the ideas sketched in this section. The 
compiler is written in Curry itself. 



5 Type Safeness 

The presented translation into Curry programs is not type safe in the sense 
that messages can be sent to objects which cannot understand these messages. 
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To detect such a kind of type errors without restricting the use of objects and 
messages, it is necessary to define a new type system and implement a new type 
checker which supports subtyping. 

5.1 Subtyping 

We introduce a new type system which uses subtype constraints for expressing 
the types of objects, messages and functions which have such argument types or 
deliver objects or messages as their results. 

First we take a look at the type of constructor functions, objects, messages 
and the predefined functions send and new. In a first step, we define three new 
predefined type constructors named Constructor, Object and Message with 
arity one. An object as an instance of a template T has type Object T. A 
message has type ti —>■•••—>■ Tji —>■ Message T, where ti, . . . , r„ are the types of 
the arguments of this message and T is the template which defines this message. 
A constructor of a template T has type ti —>■•••—>■ r„, Constructor T, 
where again t\,. . . ,Tn are the types of the arguments of this constructor. For 
example, an instance of the template Counter has type Object Counter, the 
message Get has type Int Message Counter and the constructor function 
counter has type Int — >■ Constructor Counter. With these types the function 
send must have the type 

send :: Message a — >■ Object a — >■ Success 
and new has the type 

new :: Constructor a — > Object a — >■ Success 

These types do not allow subtyping w.r.t. a Hindley/Milner-like type system 
P] as used in Curry. Therefore, we need subtyping in three cases in order to 
support object-oriented programming techniques and to combine them with the 
advantages of parametric polymorphism: 

1. We want to send messages defined in a template T to instances of subtem- 
plates of T. 

2. It should be possible to keep objects of different templates in a polymorphic 
data structure, e.g., in a list: If these objects have a common supertemplate, 
there are common messages which all of these objects understand. 

3. We also want to store messages defined in different templates in a polymor- 
phic data structure if these templates have a common subtemplate. 

Therefore, we introduce subtype constraints and constrained types. We use them 
to define new types of objects and messages which supports subtyping in the 
three described cases. Note that, in contrast to other approaches to subtyping 
or order-sorted types, we consider only subtype relations between templates and 
not subtyping of standard data types, like numbers or functions, since this is 
sufficient for our purposes. 
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Definition 1. A subtype constraint is an expression t\ < T 2 where Ti (i = 1,2) 
is a type variable or the name of a template. 

Definition 2. A constrained type is a pair t\C consisting of a type expression 
T and a set C of subtype constraints. A constrained type scheme has the form 
Vai . . . an-T\C. 

Intuitively, a constraint of the form ti < T 2 expresses that ti must be a subtem- 
plate of T2- To allow keeping instances of different templates in one polymor- 
phic data structure, an object gets the type Object a | {T < a\. For example, 
an instance of Counter gets the type Object a \ {Counter < a} and an in- 
stance of MaxCounter gets the type Object a | (MaxCounter < a}. We can keep 
both objects in a list where this list has the type [Object a] | (Counter < 
a, MaxCounter < a}. The type of the list is inferred by using standard typing 
rules but additionally collecting all subtype constraints in one set. 

Intuitively, this constraint set can be satisfied because there exists a template 
T which is a supertemplate of Counter and a supertemplate of MaxCounter: 
Counter is a supertemplate of both Counter and MaxCounter . If we mix ob- 
jects which do not have a common supertemplate, the constraint set cannot be 
satisfied. This makes sense because these objects do not have a common message 
and so there is no reason to store them in one data structure. We will formally 
define the satisfiability of a constraint set later. 

Using this type for an object, we must also modify the type of new as follows: 

new :: Constructor a — ^ Object /3 — ^ Success I {a < P} 

A similar modification of the type of a message allows to mix messages of different 
types in a common data structure: A message gets the type 

Ti T„ Message a\{a <T} 

where ti , . . . , r„ are the types of the arguments of this message. 

With these definitions it is possible to send a message defined in a template T 
to an instance of a subtemplate of T: The resulting constraint set can be satisfied 
iff the object understands the message. For instance, if we send the message Inc 
to an object of the instance MaxCounter, we get the typed expression 

send Inc maxCounterObject :: Success | (a < Counter, MaxCounter < a} 

Unfortunately, we must also modify the type of send. Consider the following 
example: 

f ml m2 ol o2 = send ml ol & send m2 o2 & send ml o2 

f has two messages and two objects as arguments. It sends the first message 
to the first object, the second message to the second object, and also the first 
message to the second object. With the type of send defined above, we get the 
type 



f :: Message a — ^ Message a — > Object a — ^ Object a — >■ Success 
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For our running example, we assume: 

Inc :: Message a \ {a < Counter} 

(SetMax 42) :: Message a \ {a < MaxCounter} 

counterObject :: Object a I {Counter < a\ 

maxCounterObject :: Object a I {MaxCounter < a} 

Thus, the application of f to these arguments would yield the type 

f Inc (SetMax 42) counterObject maxCounterObject :: 

Success I {a < Counter, a < MaxCounter, Counter < a, MaxCounter < a} 

The set of constraints of this type is not satisfiable because there is no substitu- 
tion for a such that all constraints are elements of the inheritance hierarchy. This 
does not match our intuition because it is possible to send Inc to counterObject 
and maxCounterObject and (SetMax 42) to maxCounterObject. 

The problem can be easily solved if we modify the type of send: 

send :: Message a — >■ Object (3 — >■ Success I {/3 < a} 

This type corresponds to the intuition that a message defined in template a can 
be send to all instances of template f3 provided that /3 is a subtemplate of a. 
Now the type of f is 

Message a — ^ Message 3 — >■ Object 7 — >■ Object S — ^ Success 

I {7<a,<5</3,^<a| 

and “f Inc (SetMax 42) counterObject maxCounterObject” has type 

Success I {7 < a, <5 < /3, (5 < a, a < Counter, 3 < MaxCounter, 
Counter < 7, MaxCounter < i5} 

These subtype constraints are satisfiable by the following substitution cr: 
cr(a) = Counter, cr(/3) = MaxCounter, ct( 7 ) = Counter, (t(( 5) = MaxCounter 

5.2 Core ObjectCurry 

In order to define the type system of ObjectCurry, we introduce a simplified 
core language to provide a more compact representation of ObjectCurry’s typing 
rules. The expressions and templates of the core language are defined in Fig. [T] 
An expression E of the core language is either a variable, a lambda ab- 
straction, an application of two expressions, an expression combined with the 
declaration of free variables, or a conditional expression. A template T consists 
of an initial assignment I, which defines the attributes and initial values of the 
template, and a set of methods. A template can also be defined as a subtem- 
plate of another template by an extends clause. I' contains additionally to the 
initial assignments of the subtemplate a call to the constructor function of its 
supertemplate. This ensures that each inherited attribute gets an initial value. 
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Fig. 1. A core language for ObjectCurry 



A block of assignments A consists of assignments of the form x := E where 
E is any expression. Due to the fact that a constructor function of ObjectCurry 
can have some arguments, we allow lambda abstraction on initial assignments. 

A method M is defined by an expression E and a block of assignments A. E 
has to be a constraint (a function with the result type Success) which has to be 
solved when the method is called. The assignments define the transformation of 
the current state of the object. 

A program of Core ObjectCurry is a set of definitions of functions and tem- 
plates. The definition of a function has the form functionName = E (where E is 
usually a lambda abstraction) and the definition of a template is written as 

{constrName, methodNamei , . . . , methodNamCn) = T . 



Such a program contains no local definitions, i.e., all identifiers are introduced 
on top level (thus, local declarations in ObjectCurry programs are globalized in 
Core ObjectCurry by lambda lifting). 

As an example, our original Counter and MaxCounter template definitions 
are transformed into the core language as follows: 



(counter, Inc, Set, Get) = Template Counter 

Xi . X := i 

success X := x+1 
As . success x := s 
Aw . (w =:= x) e 



(body o/counterj 
(body of Inc ) 
(body of Set) 
(body of Get ) 
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(maxCounter, Inc, SetMax) = Template MaxCounter extends Counter 

Xi . Xmi . counter i, max := mi 

success X := if x<max then x+1 else x 

Xv . success max := v, 

X if x<max then x else max 



5.3 A Type System for ObjectCurry 

Before we present a type system for this core language, we define the satisfiability 
of a set of constraints. 

Definition 3 . A (type) substitution a is a mapping from type variables to types 
such that a{a) ^ a only for finitely many type variables a. We write a substi- 
tution as follows: a = [xi/ri, . . . , x„/r„] if a^xf) = Ti for all i = 1 , . . . , n and 
^{y) = y foT all y ^ {xi, . . . ,x„}. The extension of a substitution to types and 
constraint sets is obvious. 

In the following we assume that P is a Core ObjectCurry program. 

Definition 4 . Let TL be the relation of subtemplates of P defined by its extend 
clauses. The reflexive and transitive closure of TL is denoted by TL* , also called 
inheritance hierarchy. 

Definition 5 . A substitution cr satisfies a subtype constraint ti < T2 w.r.t. the 
inheritance hierarchy TL* , denoted a Ti < T2, if there is a substitution a 
with (crTi,aT2) € TL* . 

A substitution a satisfies a set C of subtype constraints (a \=-h* C) if for all 
c€ C: a \=n- c. 

A set C of subtype constraints is satisfiable w.r.t. the inheritance hierarchy 
TL* , denoted \=-h* C , if there is a substitution a with a \=-h* C . 

Type environments collect the type information for named entities in a program: 



Definition 6 . A type environment T is a mapping from names to constrained 
type schemes. In the following we denote by TE the set of all type environments. 

The union of two type environments Pi and P2 with non-overlapping domains 
is defined as follows: 



(r writ 1 - / -^2(a), if A(o:) is undefined 
^ if P2( a) is undefined 

Additionally, we define another concatenation of two type environments Pi and 
P2 which gives preference to P2 if an identifier is a member of the domains of 
both environments. We need this operation in order to extend the global type 
environment with the attributes of a template. 



(Pi ©P 2 )(a) 



P2(a), if P2(a) is defined 
Pi (a), otherwise 



Generic instances of constrained type schemes are defined as usual: 
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Definition 7. A constrained type t'\C' is a generic instance of a constrained 
type scheme Vai . . . a„.r|C' if there is a substitution a with a t \ a C = t' \C 
and cr(/3) = (3 for all j3 ^ {ai, . . . , a„}. 

An attribute which is defined in a template T is also visible in the subtemplates 
of T with the same type. To specify the visibility of attributes in the methods 
of all subtemplates, we introduce attribute type environments: 

Definition 8. An attribute type environment 0 : Templates —1 TE maps the 
name of a template to a type environment. This type environment contains the 
types of the attributes defined in this template. 

Now we are able to define the well-typedness of Core ObjectCurry programs: 

Definition 9. A function definition f = Xxi . . .Axn-e is well-typed w.r.t. a type 
environment T , an attribute type environment &, and an inheritance hierarchy 
TL* , if the following conditions are satisfied: 

- T{f) = Voi . ..am.T\C 

- r, 0, TL* h Xxi . . . Xxn.e : t\C can be deduced by the rules of Fig. and0 

- hw 

A template definition (c, mi, . . . , m„) = e is well-typed w.r.t. a type environment 
r , an attribute type environment 0, and an inheritance hierarchy TL* , if 

- r{c) = ro|Co, r{mi) = 'iai.n\Ci for i = 1, . . . ,n, 

- r,0,TL* he: {tq\Cci,ti\Ci, . . . ,Tn\Cn) can be deduced by the rules of Fig. ^ 
and0 

- hw C'o U Cl U . . . U C„ 

A Core ObjectCurry program is well-typed if there exist a type environment F, 
an attribute type environment 0 and an inheritance hierarchy TL* such that all 
function and template definitions are well-typed w.r.t. these environments and 

C(send) = Vti, T 2 . Message t\ Object T 2 —1 Success I {t 2 < ti} 
T(new) = Vti, T 2 . Constructor ti — > Object T 2 —1 Success I {ti < T 2 } 

In the inference rules of Fig. El and El we use the auxiliary functions super and 
templates which yield all supertemplates of a template (including the template 
itself) and all templates of a program, respectively. 

In order to check the well-typedness of a program by the rules of Fig. Eland El 
the type environment F must contain the types of each defined function and 
template. The attribute type environment 0 maps the name of each template 
to a new type environment which contains the types of the attributes defined 
in that template. The inheritance hierarchy consists of the subtype relations 
between all templates which are defined in the program. 

The inference rules [Axiom], [Abstraction], [Existential], and [Application] 
are defined in the usual way, compare the Curry Report [^. The only modi- 
fication is the collection of all constraints of all subexpressions into one set of 
constraints. The satisfiability of this constraint set is checked outside the typing 
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[Axiom] if rjC generic instance of r{x) 

i , fc/, rx \ X \ T\i^ 



[Abstraction] 



r[x/T\ClO,H* ^E-.t'\C 
r,0,n* \- Xx.E : T t'\c 



[Existential] 



[Application] 



r[x/T\c],0,u* \-e-t'\c 

r,0,n* h let a; free inE:r']CUC" 

r, 0, U* h Ei-.TI^ t 2 \Ci r, 0, H* 'r E2-. n\C2 
r,0,H* h E1E2 irajCiUCz 



name G templates {H*) 

(name,x) ^ T-L* for all x G templatesiH*) with x 7^ name 
[Template] E' = E ® 0 {name) 

E',0,H* I -.rolCo E',0,H*^^t^"^<^ Mi-.n\Ci = 

F, 0 , I- Template name / Ml ...Mm : (rojCo, ''"ijC'i, • • • , T-mjCm) 

{namei , name2) £H* ,{name2,namei) 

^ F 0 LJpesTzper('H* , namei ) 

Pi £ super {H* , namei) (i = l,...,m) 

[Subtemplate] namei 

E -rolCo E' ,0,H* Mi ■. Tj\Ci = 

F, 0 , H* h Template namei extends name2 I' Mi . . . Mm 
: (ro]Co,n]Ci \Cm) 



Fig. 2. Typing rules for ObjectCurry programs (1) 



rules in the definition of a well-typed program (see Def. O . In the rule [Abstrac- 
tion] we do not have to collect the constraints C of the type of the variable x'. If 
E contains an occurrence of x, the constraints of the type of x are collected into 
the set of constraints of E by the other rules. Otherwise, x is never used and its 
constraints can be ignored. 

In addition to Curry’s type system, we introduce new rules [Template] and 
[Subtemplate] for checking the types of templates and subtemplates. In the rule 
[Template], which is applicable if there is no true supertemplate in EL* , we ex- 
tend the type environment F by the type assumptions for the attributes of the 
template in order to make the attribute types visible in the type checking of 
the methods. Note that the global type environment F contains the types of all 
identifiers defined in the program (including the method identifiers) so that we 
can use the methods of the template also inside the template and we do not need 
a special rule for recursion. 



ObjectCurry: An Object-Oriented Extension of Curry 103 



[Assignment]^] 



r,0,H* x-.t\Ci r,e,H* \- e :t\C 2 r,e,H* \-a a-.Ca 
r, 0, H* \-ax--E,A-.CiUC 2 LI Ca 



[Assignment,] ^ ^ g 



[Init] 



r, 0 ,H* ^aA-.c 

r,0,'H* A : Constructorname|C 



[Inif] 

[Method] 



r, 0, H* b E : Constructor nome 2 |e E, 0, H* \~a A : C 

r,0,'H* . Constructornamei|C 

r,0,'H* h E : Success|C E,0,'H* \~a A \ C' v new type variable 
r, 0, H* E^ A- Message t|{v < name} U C U C" 



[Abstraction^^] 



r\x/T\C],0,H* hS.A:r'|C" 
r,0,H* hj \x.X : r r'|C" 



A e M} 



Fig. 3. Typing rules for ObjectCurry programs (2) 



The rule [Subtemplate] is similar to [Template] except for the following dif- 
ferences: 

— The type environment E' also contains the type assumptions of the inherited 
attributes, i.e., the attributes of the current template and all its supertem- 
plates. 

— /' contains a call to the constructor function of the parent. It must be checked 
that this has the type Constructor name2 where name2 is the name of the 
parent. This is ensured by using h// instead of h/. 

— Furthermore, we have to ensure that {namei,name2) is an element of the 
type hierarchy H* and (name2,namei) must not be in H*. Due to the 
fact that H* is transitive and reflexive, it also contains {name2,name2), 
{namei,namei), and {namei,p) for all supertemplates p of namei. 

— For checking the types of the methods, we also allow that a method Mi 
is assigned to some supertemplate Pi (note that pi is the current template 
namei or one of its supertemplates). This is necessary if the method is 
redefined. Note, however, that methods redefined in subtemplates must have 
the same type as in supertemplates. This is reasonable since, due to the logic 
features of Curry, arguments of a method can be used as value parameters 
as well as result parameters so that a contra- or covariance restriction on 
arguments cannot be clearly required. 

[Template] and [Subtemplate] use the rules of Fig. 0 which we discuss next. The 
rule [Assignment]^] ensures that in an assignment of the form x := E the type 
of X is the same as the type of the expression E. [Assignment,] handles the 
special case of an empty list of assignments. The rule [Init] checks the type of 



104 Michael Hanus, Frank Huch, and Philipp Niederau 



a constructor function where the name of the template must be provided as an 
extra argument. [Inif] additionally checks if £1 is a valid call of the constructor 
function of the parent. For this purpose, we also need the name of the parent 
(name 2 )- The rule [Method] types a method with subtyping the result type as 
discussed in Sect. 15.11 It checks whether the expression E of a method E => A 
is a constraint (with the type Success) and collects the resulting constraints. 

Due to the fact that we need lambda abstraction over initial assignments I 
or /' and methods M, we introduce a generic rule [Abstraction;^]. X can be I, 
I' or M. The rule is similar to the common rule for abstraction. 

5.4 Type Inference 

We have also developed a type inferencer for our modified type system. Due to 
lack of space we can not present it here but refer to [T^ which contains the com- 
plete description of the type inferencer and its implementation. The algorithm is 
based on the algorithm T> of Kaes [TO]. However, our inference algorithm is sim- 
pler because we allow subtyping only for objects and messages. The algorithm 
unifies type expressions in the same way as standard type inference algorithms 
13 but additionally collects the subtype constraints. The resulting set of subtype 
constraints is then checked for satisfiability with a simple test procedure. 

Our implementation of the type checker for ObjectCurry is based on Mark 
Jones’ “Typing Haskell in Haskell” |2] which we adapted to Curry. The imple- 
mentation of the ObjectCurry compiler together with the type inferencer is freely 
available from the authors. 

6 Related Work 

In this section we compare ObjectCurry with some other approaches for the 
object-oriented extension of functional (logic) languages. 

Oz |17| is a concurrent constraint programming language with a particu- 
lar syntax for object-oriented programming, thus, offering similar features as 
ObjectCurry. The main differences between ObjectCurry and Oz are the type 
system and the operational semantics. Oz is untyped and supports no detec- 
tion of type errors at compile time in contrast to ObjectCurry. Furthermore, the 
operational model of ObjectCurry is based on Curry’s computation model jl] 
which combines an optimal lazy evaluation strategy [Tj for the functional (logic) 
parts of a program with the concurrent evaluation of constraints. In particular, 
we consider objects as functions consuming the stream of incoming messages 
where the state is passed as an argument between the different function calls. 
In contrast, Oz evaluates functions in an eager manner and implement stateful 
objects via a specific cell store. 

Haskell-I— I- | 7 ] extends Haskell’s type classes to object classes. It provides 
a limited form of multiple inheritance and virtual methods but does not pro- 
vide subtype polymorphism. For instance, it is not possible to create a list with 
elements of different instances of one object class. The main goal in the devel- 
opment of Haskell-I — I- was a minimal extension to Haskell which supports the 
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inheritance of functions. Objects in Haskell-|-+ contain only methods but no 
states. On the other hand, ObjectCurry provides real objects with states in the 
sense of object-oriented programming. It combines the flexibility of conventional 
object-oriented languages with the features of functional logic programming. 

O’Haskell CT41 provides an extension for full object-oriented programming 
with states and subtype polymorphism. It uses monads for the implementation of 
concurrent objects and states. The main advantage of our implementation, which 
uses the concurrent and logical features of Curry, is the opportunity to combine 
this with Curry’s port concept [S] for distributed programming. In contrast to 
O’Haskell, objects in ObjectCurry can also be executed in a distributed setting. 
This is supported by a function newNamedObject which is similar to new but 
makes the new object accessible from other machines in the network with a 
unique port identifier (see |S] for more details). The implementation of objects 
remains unchanged. Furthermore, the logical variables in Curry can be exploited 
as answer channels since the receiver of a message can bind the logical variables 
in the message to send answers back to the sender. 

Finally, Objective Caml im is an object-oriented extension of ML. Objective 
Caml inherits the strict evaluation strategy of ML and subtype polymorphism 
can only be programmed with explicit coercions in contrast to ObjectCurry 
which is lazy and provides subtype polymorphism without any annotations since 
all types can be automatically inferred. 



7 Conclusions 

We presented the language ObjectCurry as an extension of Curry to allow a 
convenient definition of objects via templates. Templates play the role of classes 
in conventional object-oriented languages. A template defines the attributes and 
methods of an object. Methods are used to determine the reactions to incoming 
messages where reactions can be the change of the object’s state or a constraint to 
send messages to other objects. Assignments are used to express a transformation 
on the local state of an object. Templates can also inherit attributes and methods 
from other templates and inherited methods can be redefined. 

We proposed a direct translation of templates into pure Curry but translated 
target programs using more than one template are not type safe in the sense 
of traditional typed object-oriented languages. Therefore, we developed a new 
type system which uses subtype constraints in the types of objects, messages 
and functions which use objects or messages. We implemented a compiler which 
translates ObjectCurry programs into Curry and a type checker which also infers 
types of expressions without explicit type annotations. 
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Abstract. We present an extension of the lazy functional programming 
language Haskell for distributed programming. For the communication 
between processes we add a port concept. Ports behave like channels in 
Concurrent Haskell except that only the process which creates a port can 
read from it. Ports can also be sent through other ports. The receiver can 
then also write messages through the received port. This is independent 
of the location in a network. The programmer uses the same functions 
to write to local or remote ports. Communication between concurrent 
and distributed processes is programmed with the same functions. Con- 
current processes can easily be distributed, for example to provide seal- 
ability of a system. In many distributed applications it is necessary that 
two independently started programs can connect at runtime. Therefore 
we provide the registration of ports. Other processes can look them up 
from anywhere in a network. 

The implementation consists of a library which provides functions for cre- 
ating new processes, communication between concurrent and distributed 
processes, and error handling with exceptions. 



1 Distributed Programming 

The development of software systems has changed in the last years. Many sys- 
tems are distributed, because of the following reasons: 

— Parallelization: Resources (e.g. speed or space) needed for an application 
are not sufficing on one computer. 

— Inherent distributed character: The application itself is distributed. Ex- 
amples are (mobile) telephones and a cash dispenser together with the bank 
server. 

— Reliability and fault tolerance: To increase the reliability of a system 
it is possible to arrange for several computers to co-operate such that the 
failure of one or more computers does not effect the system behavior as a 
whole. 

— Access to special resources: In a heterogeneous network, special re- 
sources, e.g. a scanner or printer can only be accessed from one computer. 

With the boom of Networks and the Internet, the number of distributed appli- 
cations increases. In particular more and more applications have an inherent dis- 
tributed character. To provide convenient programming, modern programming 
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languages must support distributed programming. It is not sufficient to pro- 
vide a library for communication via sockets. The language has to be extended 
with a high level concept for distribution and communication between processes. 
We want to extend the functional programming language Haskell |J~*~98| with 
features for elegant distributed programming. 

The discussion of communication in distributed systems in Sect. |2] shows 
the advantages of a port concept, which is introduced in Sect. [Sj We present 
the implementation in Sect. 0] and discuss related work in Sect. O Section E 
concludes and discusses future work. 

2 Distributed Communication 

There have been made some approaches for extending functional languages 
for concurrent or distributed programming. The most successful one is Erlang 
[IAWV93] . which was developed by Ericsson and has been used for the devel- 
opment of many telecommunication applications. Erlang is an eager functional 
programming language, which is extended with special features for concurrent 
and distributed programming. New processes can be created with spawn on a 
local or a remote computer. Every process has a process identifier (pid), which is 
used for the communication between processes. Other processes can send mes- 
sages to this pid and messages can also contain pids to distribute pids in a 
network. If a message is sent to a process it is stored in a mailbox. The receiv- 
ing process can conveniently access this mailbox with pattern matching. It need 
not extract messages in their chronological order. Only relevant messages can 
be fetched with pattern matching. The others can be left in the mailbox and 
processed later. 

Another important point of Erlang is that communication between concur- 
rent processes on the same computer does not distinguish from communication 
to remote processes in a network. The programmer uses the same programming 
techniques. Therefore a system developed in a concurrent setting can later be 
distributed easily. Scalability of the system is supported by the language. 

For fault tolerant programming Erlang also provides a linking mechanism. 
Processes of the system can automatically be informed, if others die, for example 
because a computer crashes. Hence these processes can react on the failure and 
reorganize the system to a consistent state. 

But for all that Erlang has a great disadvantages. It is untyped. Therefore it 
is more difficult to find program errors than in Haskell. There have been made 
two approaches for typing Erlang [MW97IAA98j . but they only type sequential 
programming in Erlang. The communication stays untyped. But also here a type 
system is necessary. In Erlang already typing mistakeqj, like writing the atom 
lookup in the pattern of a receive statement and lookUp at the corresponding 
send statement does not yield a compile time or runtime error, but a deadlock. 
Finding these typing mistakes is very difficult. 

Here we do not mean type errors. An example for a typing mistake is ‘Erlng’. 
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We want to extend Haskell with a similar communication mechanism, but 
in consideration of Haskells type system. This yields more safety in program 
development. Concurrent Haskell |JCF96| is state of the art for concurrent pro- 
gramming in Haskell. It provides functions to start and terminate threads in- 
side an application and to synchronize them with mutable variables (MVar) in 
the 10 monad. On top of these MVars it also provides semaphores and asyn- 
chronous channels for message passing. But there is no concept for distributed 
programming. Therefore we want to extend Haskell with a powerful mechanism 
for distributed programming. 

In our first approach we extended Haskell with an Erlang-style communi- 
cation mechanism [Hiic99J . The main problem in this approach is to type the 
communication. Different processes can understand different messages but oth- 
ers can also understand the same messages. A type system with subtypes would 
be needed. But it is difficult to integrate subtyping into Haskells type system. 
Therefore we have implemented runtime type checking. But this is no good 
match to Haskells type system. 

We now have decided to extend the concepts of Concurrent Haskell with 
communication via channels and MVars for distributed programming. But this 
leads into implementation problems. In Concurrent Haskell many processes may 
synchronize on a mutable variable {MVar) or a channel. Distributing some of 
these processes in a network leads to synchronization problems, because it is not 
clear where an MVar is located. Consider the following situation: 




mutable variable 




Fig. 1. Distribution of Concurrent Haskell Processes 



Two processes can write a mutable variable and two other processes want to 
read it. In a distributed setting these processes could be located on four different 
computers in a network. But where can the mutable variable be located? It has 
to be located on one computer in the net, because it needs a state for the 
storage of a value, if no reader suspends on it and no reader wants to read 
the value. The possibilities for the location of the MVar are one of the four 
computers of the example or an independent computer. But all these locations 
have disadvantages for providing fault tolerance. It is necessary that parts of the 
system may terminate or even crash without effecting the rest of the system. If 
the computer the mutable variable is located on crashes, then the whole system 
cannot work anymore, although there is still a writer and a reader, which could 
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communicate with each other from the logical structure of the system. The 
readers and the other writers hang-up and it is difficult to repair the system to a 
consistent state, where the other components can communicate with each other 
again. 

Another problem is garbage collection. We do not know, when a mutable 
variable or channel is garbage. Other processes in the distributed setting com- 
municate with each other and the MVar (a reference to it) can also be distributed 
to the whole network. So it will be almost impossible to check, if an MVar is 
still known somewhere in the net and an algorithm would be very expensive and 
produce much communication in the network. Especially, if some parts of the 



system have already crashed. Systems like Glasgow Parallel Haskell THM+96 



or Glasgow Distributed Haskell jPTTjflfl] implement distributed garbage collec- 
tion, but they do not provide the development of open systems, which is essential 
for the development of distributed applications. It is also not possible to imple- 
ment fault tolerance in these systems, which is in our view a major requirement 
for a programming language for distributed system. It would be interesting to 
investigate, if distributed garbage collection can be efficiently implemented with 
the conditions of programming open and fault tolerant systems. We think that 
these algorithms are too complex and need too much communication, i.e. traffic 
in the network. Therefore we restrict communication to only one reader for each 
channel and can avoid these problems. Our practical experience showed, that 
this restriction is no disadvantage, as we will discuss later. 



3 Distributed Haskell 

In the discussion of distributing Goncurrent Haskell, we have seen which prob- 
lems appear with multiple readers and writers of an MVar. Our solution to this 
problem is a restriction to only one reader. With this restriction we can locate 
the MVar at the same place, where the reader is located. If the reader terminates 
or crashes, the MVar terminates too. No other readers can suspend on it. There- 
fore no processes are hanging. On the other side, there can still exist writers, 
that want to write to the MVar. Hence they are also in an inconsistent state. 
But they can recognize this, when they send a message to the MVar. A failed 
write operation can throw an exception. This can be caught and the writer can 
initialize a reorganization of the crashed components, for example on another 
computer. 

Erlang also has this restriction, because a mailbox is associated with a single- 
ton process. Only this process can read from it. Transferring the Erlang model to 
Haskell leads to typing problems, because the Mailbox of a process needs a fixed 
type in Haskell. But consider two clients, which communicate with a database 
server, but handle different jobs in the rest of the system. Both will receive mes- 
sages from the database, for example if they look up the value of a given key. 
Therefore there mailboxes must have the same type. But in other parts they 
communicate with different processes and exchange different data there. The 
result would be that all mailboxes of the system must have the same type. So 
type checking does not help to find errors. Another problem is that the system 
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is not structured any more, because the messages a process may receive can- 
not be represented as one data type. But these messages are its communication 
interface and they have to be visible. 

Another solution would be sub- typing for the messages. We have implemented 
this in |Huc99| by runtime type checking. But this concept is no good match to 
Haskells type system. Our way out of this problem is to allow multiple mailboxes 
for every process, like in Concurrent Haskell. We call these mailboxes ports and 
every port of a process may have another type. But we restrict these ports to 
one reader. This reader is the process which creates the port. 

3.1 The Distributed Haskell Library 

Ports are represented as a polymorphic data type 

data Port a — abstract 

where the type variable a represents the type of the values, that can be sent to 
the port. A new port can be created with the function 

newPort : : 10 (Port a) 

Like in Concurrent Haskell the operations for creating and sending have side- 
effects. Hence they belong to the 10 monad in contrast to Eden |BLOMPM96| 
and Coffin ICCK98I . 

A value can be written to and read from a port with the functions 

writePort : ; Port a -> a -> ID () 

readPort : : Port a -> ID a 

A port can be used in the same way, as a channel in Concurrent Haskell, except 
that only the process which creates the port can read from it. 

To guarantee this by the type system we first wanted to distinguish read- 
write ports and write ports. Read-write ports can be converted into write ports 
and only write ports can be sent through other write ports. But sending a port 
to another process is not the only possibility for distributing ports in a process 
network. They can also be passed to other processes as parameters, when new 
processes are created with f orkID. Here we cannot avoid the distribution of read- 
write ports. Therefore we have renounced this distinction. A runtime checking 
is needed anyway. Hence, reading from a write port yields a runtime error. 

A postulation we made above is that it is all the same, where a port is located 
in the net. Sending to it should stay the same. In the case of a remote computer 
the messages have to be coded binary. In this first implementation, as a library 
for the Glasgow Haskell Compiler |CHCj . we send them as strings. Therefore the 
type of the messages, which can be sent with writePort must be an instance 
of the class Show. On the other hand a messages must be reconverted into the 
corresponding datatype, if a process reads from a port. Therefore the type of 
the messages, which can be received with readPort must be an instance of the 
class Read. This means the messages of a port need both instances. But this is 
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no problem in Haskell, because they can be derived for algebraic data types. But 
this is a restriction, because no functions, infinite data structures, or mutable 
structures like MVars can be sent through ports. 

On the first sight these functions seem to be enough for programming dis- 
tributed systems. But consider the development of a chat. Designing a chat 
system one will program a chat server, which manages the clients taking part 
in the chat. New clients can join the chat and others can exit. But how can 
new clients connect to the chat server? For the connection of two independently 
started components we provide a global registration mechanism. With 

registerPort : : Port a -> PortNamie -> ID () 

unregisterPort : : Port a -> ID () 

ports can globally be registered and unregistered on one computer. Other pro- 
cesses can then lookup a registered port with 

lookupPort : : PortHost -> PortNamie -> ID (Port a) 

from anywhere else in the net. In the actual implementation PortHost and 
PortName are just type synonyms for String. But in later implementations 
we will also extend PortHost to IP addresses and allow Haskell programs with 
names, as nodes in Erlang. Then PortName will also contain the name of the 
Haskell program and ports in different Haskell programs can be registered with 
the same name on one computer. This can for example be useful, because two 
servers which register themselves globally with the same name can still be exe- 
cuted on the same computer. 

With these functions for registration and lookup of ports the chat server can 
register its in-port and the clients can lookup this port. Their first message to 
the chat server can contain a port, which the client has created and to which 
the server sends further messages. 

But this example shows another problems. When a client has accessed a chat, 
it sends new text messages to the server and this server broadcasts them to the 
other clients. For the client process this means, that it has to read messages 
coming from the server and messages coming from the keyboard or another 
process managing the user interface. But there is no fixed order in which the 
client will receive these messages. Therefore it has to suspend on both. The 
easiest way would be, that the keyboard process and the chat server use the same 
port and the client process reads from it. But from the software engineering view 
this it not nice, because the chat server and the keyboard use messages of the 
same type to communicate with the client. Hence we need two different ports, 
the client can suspend on. Therefore Eden proposes a merge function which 
would in our setting have the type merge : : Port a -> Port a -> ID (Port 
a) . But this function would not solve the problem above, because still both ports 
must be of the same type. Therefore we provide a merge function, which allows 
a programmer to merge two ports of different types. We use Haskells Either 
type: 

mergePort :: Port a -> Port b -> ID (Port (Either a b)) 
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With this function the client can suspend on messages of different types from 
the chat server and the keyboard. It should also be possible, that the merged 
ports can afterwards also be used in there un-merged version, because the client 
wants to ignore messages from one of them. But the client is the only process 
that can read from these ports. Hence there can be no conflict, that two readers 
want to read from the ports and their merged version at the same time. 

For the creation, termination and error handling of processes we use the same 
functions as Concurrent Haskell and the module Exception: 

forklO : : 10 0 -> ID ThreadlD 

myThreadId : : ID Threadld 

killThread : : ThreadID -> ID () 

raiseInThread : : Threadld -> Exception -> ID () 

try : : ID a -> ID (Either Exception a) 

Finally we provide a linking mechanism. With the function 

linkAndKill : : Port a -> ID (Link) 

a link between the executing process and a port is established. If the port does not 
exist anymore, the process is terminated by an exception. This can be caught and 
the process can initiate a reorganization of the whole system. A more convenient 
function for linking ports is the function 

link : : Port a -> ID 0 -> ID (Link) 

which takes an additional ID action as parameter. This action is performed, if 
the linked port does not exist anymore. With this function it is for example 
possible, to send a message if a port dies. If an established link is not needed 
anymore, it is possible to abolish links with the function 

unlink : : Link -> ID () 

We have also added a fault tolerant version of writePort 

writePortFail : : Port a -> a -> ID () -> ID () 

Similar to link the action in the third parameter is performed in the case of an 
erroneous sending. 

3.2 A Chat Example 

As an example we now present the implementation of a chat server. Its commu- 
nication interface is given by the data type 

data ServerMsg = Connect String (Port ClientChatMsg) I 
Send String String I 
Close String (Port ClientChatMsg) 
deriving (Read, Show) 

The server creates a port for external communication in its initialization and 
registers this port as ChatServer. 
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main = do serverPort <- newPort 

registerPort serverPort "ChatServer" 
chatServer serverport [] 

After the initialization it proceeds in a loop, where it holds a list of the ports 
of all connected clients. This list changes in dependence of connecting or closing 
clients. A new chat message is broadcasted to all the other connected clients: 

chatServer : : Port ServerMsg -> [Port ClientMsg] -> ID () 
chatServer serverPort clientPorts = do 
msg <- readPort serverPort 
case msg of 

(Connect name clientPort) -> do 

mapM_ (\p -> writePort p (Login name) ) newClientPorts 
chatServer serverPort (clientPort : clientPorts) 

(Close name clientPort) -> do 

let newClientPorts = filter (/= clientPort) clientPorts 
mapM_ (\p -> writePort p (Logout name)) newClientPorts 
chatServer serverPort newClientPorts 
(Send name str) -> do 

mapM_ (\p -> writePort p (Chat name str) ) clientPorts 
chatServer serverPort clientPorts 

A client process uses the interface 

data ClientChatMsg = Chat String String I Login String I 
Logout String 
deriving (Read, Show, Eq) 

to receive messages from the chat server. The Chat message is used for new chat 
messages. The first string is the nickname of the user taking part in the chat 
and the second string is her chat message. The Logout message is sent, if a client 
leaves the chat. For the interface to the keyboard process we just use strings. 
First a client process initiates the connection to a chat server on host. Then 
its own read ports are created and a process for the input from the keyboard is 
forked. 

main = do putStrLn "Host of chat server ? " 
host <- getLine 
putStrLn "Nickname? " 
name <- getLine 

chatserverPort <- lookupPort host "ChatServer" 
chatPort <- newPort 

writePort chatserverPort (Connect name chatPort) 

keyboardPort <- newPort 

forkID (readKeyBoard keyboardport) 

inPort <- mergePort chatPort keyboardport 

client chatserverPort inPort chatPort 

The two read ports are merged into one port and the process proceeds in a loop. 
Here messages from the server are displayed and messages from the keyboard 
are forwarded to the server. 
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client : : String -> Port ServerMsg -> Port (Either ServerMsg String) 
-> Port ClientChatMsg -> ID () 
client name serverPort inPort chatPort = do 
msg <- readPort inPort 
case msg of 

(Left ServerMsg) -> do putStrLn (display serverMsg) 

client name serverPort inPort chatPort 
(Right "") -> writePort serverPort (Close name chatPort) 

(Right str) -> do writePort serverPort (Send name str) 

client name serverPort inPort chatPort 
where display : : ClientChatMsg -> String 

display (Chat name str) = name ++ " : " ++ str 

display (Login name) = name ++ " logged in" 

display (Logout name) = name ++ " logged out" 



The process for reading from the keyboard is not presented here. It just reads 
strings from the keyboard, sends them to the client process and terminates itself, 
if the user inputs the empty string. 

This chat application does not behave fault tolerant. If a client dies and 
the next chat message is broadcasted to all chat clients an exception is thrown 
and the chat server crashes, because the port of the dead client does not exist 
anymore. With the linking mechanism and the use of writePortFail instead of 
writePort we can easily guarantee fault-tolerance for our server. In the case of 
a failure we just close the port, which could not be written. Therefore we just 
add the following linking to the Connect case in the server process: 

(Connect name clientPort) -> do 

link clientPort (writePort serverPort (Close clientPort)) 
chatServer serverPort (clientPort : clientPorts) 

and modify the writePort instruction in the broadcast of chat messages: 

writePortFail p ... (writePort serverPort (Close p) ) ... 

The chat server just sends a Close message to itself if a port does not exist 
anymore. 

The example shows, that it is easy to implement a client server architecture 
like a chat in Distributed Haskell. Our practical experience shows, that our 
communication model with only one reader per port is no real restriction in 
the development of arbitrary distributed systems. This is also confirmed by the 
success of Erlang, which has the same restriction. 

As an example multiple readers seam to be useful for architectures like the 
producer/consumer problem. But this problem can easily be implemented in 
Distributed Haskell too. We add a process, which simulates the store for the 
products. The producers insert products by sending messages to the store pro- 
cess. The consumers request the store for new products and obtain them as 
messages from the store process. The system is divided into two client/server 
architectures. The opportunity of this implementation is that the processes can 
be linked to the ports of the store and we can detect if one of the components 
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(e.g. the store) crashes. In this case we can replace the store by a new process. 
Some products may be lost, but the system still works. Another possibility would 
be the definition of a second store process, which is (with some delay) identical 
with the other store and can replace it if it crashes. This implementation of 
fault tolerance seems to be expensive in this case, but we think this cannot be 
avoided. 

4 Implementation 

The implementation is a library, which can be imported in any Haskell program. 
The whole library is written in Concurrent Haskell. 

In our port based extension we have to handle two different kinds of communi- 
cation through ports. For the internal communication inside a Haskell program, 
which consist of multiple concurrent processes, we want to use communication 
through channels from Concurrent Haskell. For the external communication be- 
tween different Haskell programs, we want to use communication via TCP/IP. 
Therefore we use the GHC library Socket. 

We also allow multiple Haskell programs on one computer. Hence distributed 
systems can also be developed on a single computer. No network of multiple 
computers is needed. Therefore on every computer an external post office is 
started. This external post office listens on a fixed socket. All started Distributed 
Haskell programs register themselves here and obtain a program number and a 
free socket number, they listen to for external messages. Global registrations 
of ports in the Distributed Haskell programs are recorded in this external post 
office and other processes can lookup the corresponding ports here. 

The use of ports is supposed to be transparent, independent of the host a 
writing process is located on. Therefore the internal representation contains the 
IP-Address of the host, where the port is located and the program number of 
the Haskell program it is created in. Therefore messages that are written to the 
port from outside can be directed to the correct Haskell program. We want to 
guarantee, that only the process which creates a port can read from it. Hence a 
port also contains the ThreadID of its creating process. If another process tries 
to read from it, then a runtime exception is thrown. 

For the internal communication through ports we use channels. This is effi- 
cient and even lazy. For external communication we cannot communicate through 
channels. The values which are sent through a port need a representation as a 
sequence of bytes. We use a string representation. Values can be converted from 
and into this representation with read and show. For algebraic data types, which 
do not contain function types, these functions can easily be derived in Haskell. 

All external communication is sent through the external post office. Therefore 
all newly created processes are registered in the external post office. They also 
obtain a socket number, on which all messages from external processes arrive. 
When a new port is created a process is forked, which listens on this socket 
and forwards all incoming messages to the typed channel representing the port. 
When this process is created the type of the port is known. The conversion of 
incoming strings into typed values with read is fixed for the lifetime of the port 
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and this process. So all messages which are sent to a port arrive in the typed 
channel, independently from their origin. The function readPort only tests, if 
the reading thread has created the port and then reads from this typed channel 
with readChcUi. 

For communication via ports it is also necessary, that ports can be sent 
through external ports. This means that they also need a string representation 
and are an instance of the classes Read and Show. As long, as a port is just 
sent through the typed channel representing a port (internal communication) its 
representation stays unchanged. Only if it is sent through the net we lose the 
typed parts. The string representation only contains the threadID, the program 
identifier, the IP-address of the host, and the socket number. If a port is sent 
through the net and returns to the program where it is physically located, it 
is not possible to access its typed version again, because of the Haskell type 
system. We would need a data structure, which can contain ports of arbitrary 
types to get back the typed representation of the port. As an optimization we 
avoid communication via sockets in this case. For every port we have added a 
string channel, through which this kind of internal communication is executed. 
Again a special process forwards the messages from this string channel into 
the typed channel. This process is created in the function newPort. The string 
channel can be stored in a database. When a port returns to its origin, this 
string channel can be looked up in the database and added to its representation 
again. This lookup is a sideffect. It is implemented with unsaf ePerf ormlO, but 
it behaves transparent to the application, because every port will only obtain 
one corresponding representation. The whole structure of a port and the internal 
and external communication is summarized in Figure |2] 




Fig. 2. Structure of a port and internal/external communication 



The two conversion processes for each port, which convert the string mes- 
sages into typed messages with read have a disadvantage. They impede garbage 
collection of a port. They always hold a reference to the string channel and 
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the typed channel of a port. Therefore these port components never become 
garbage. As a solution we have implemented a function destroyPort : ; Port 
t -> I0() which terminates the conversion processes and removes all entries 
of the port in the internal and external post office. If a new port is created we 
bind this function as a finalizer to the port. Hence if no reference to the exists 
anymore, then the conversion process is killed and all references to its channels 
are eliminated. So the channels become garbage too. 

The last internal representation of a port is the merged port as read only port. 
Both merged ports are internally represented by channels. If a process wants to 
read from a merged port, two processes are created, which suspend on these 
two channels. If one of them receives a message, this message is extended with 
the constructor Left respectively Right and transfered to an MVar, representing 
the merged port. After that both processes are terminated and the original ports 
can be read again. The only problem here is that both processes can read their 
channels simultaneously. We have to guarantee mutual exclusion here, but this is 
impossible without busy waiting. Hence in this case we write one of the received 
messages back into the corresponding channel using the function unGetChan. 

The function register registers the string channel component of the port 
at the external post office. With lookupPort this string channel can then be 
accessed from anywhere in the net. 

For the process manipulation we just use the functions of Concurrent Haskell. 

Finally we provide a polling mechanism for the linking of ports. All linked 
ports are stored in a database. A process in the background polls all registered 
ports in a fixed schedule. If one of these ports does not exist anymore, an ex- 
ception is thrown (linkAndKill) or the specified 10 action is performed (link, 
writePortFail). This could be programmed by hand from the programmer, but 
with the provided functions it is more convenient. 



5 Related Work 

There are many other approaches for the extension of functional programming 
languages. We compare the main ones with our approache: 

— GofRn [( XIKh8| extends Haskell with concurrent constraint programming 
and a special port concept for internal and external communication. The ports 
are not integrated in the 10 monad. Nondeterminism and input/output is not 
encapsulated in the 10 monad anymore. Furthermore a user has to learn about 
concurrent constraint programming, another programming paradigm. 

Coffin does not restrict the number of readers and writers of a port. As we 
described in Sect. [2] multiple readers of a port can yield problems implementing 
fault tolerant systems. An implementation of this concept seems difficult. This 
is confirmed by the fact, that an executable implementation of Coffin does not 
exist yet. 

For reacting on multiple ports Coffin proposes a fair merge : ; Port a -> 
Port a -> Port a for ports. With this function a process can wait for different 
messages of different ports and branch in dependence of the incoming message. 
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But both merged ports must have the same type. This restriction is a too strong, 
as we have seen in the chat example. 

— Eden | BLOMPM96| is an extension of Haskell for concurrent and parallel 
programming. A process concept is added in which every process has a fixed 
number of input and output channels for communication with other processes. 
Communication is not integrated in the 10 monad and with a fair merge, which 
is part of Eden, processes can behave nondeterministically. Like in Coffin this 
merge is restricted to channels of the same type. 

Processes can suspend on their different input channels. Messages in other 
channels are buffered automatically. To react not only on one input channel, it is 
possible to merge channels, with the same restrictions as in Coffin. Furthermore 
in Eden a process can only read from or write to on a fixed number of channels. 
The connections between the processes cannot be changed dynamically. 

Eden is developed for parallel programming where programs have a more 
hierarchical structure than in distributed programming and it is difficult to im- 
plement complex protocols in Eden. It is also not possible to connect two in- 
dependently started processes in Eden. But this is needed for many distributed 
applications. 

— Curry |Han99| is a functional-logic programming language, which extends 
Haskell with needed narrowing, residuation, and encapsulated search. For dis- 
tributed programming Curry adds named ports, which guarantee, that all readers 
of one port are executed in the same Curry program. This is thought to elim- 
inate the implementation problems with multiple readers. On the other hand 
a programmer can also send logical variables through ports, which is proposed 
as an easy answering mechanism. But with these logical variables channels can 
be programmed, which have multiple readers. This results in the same problems 
with multiple readers, which should be avoided by the introduction of ports. The 
problems are also reflected in the fact, that no actual Curry implementation pro- 
vides unrestricted sending of logical variables through a net. Logical variables 
can only be used as comfortable answer variables. This is no restriction from the 
formal semantics, but from the implementation. 

Communication in Curry is like in Coffin a constraint, which has to be solved. 
Only for the external communication the 10 monad is used. Concurrent processes 
communicate via lazy streams. This can result in problems with laziness and 
strictness annotations have to be added sometimes. Another problem is that a 
concurrent application cannot easily be distributed to a network, because the 
processes have to be transfered into the 10 Monad. This can yield problems with 
the scalability of a system. 

— Glasgow distributed Haskell [IPTLOO] is an approach for the integra- 
tion of Glasgow parallel Haskell THM~*~96 and Concurrent Haskell. It is provides 
closed distributed systems and communication between processes like in Con- 
current Haskell. The main idea is the distribution of a shared memory system. 
Communication between processes is not strict. Hence the programmer does not 
know when data is exchanged between the components of the network and can 
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not estimate when and how much net traffic is produced. But this is necessary 
for the development of distributed applications. 

Fault tolerance is restricted to error handling like in Concurrent Haskell. If 
one of the computers in the network crashes, then the whole system crashes 
too. But the greatest disadvantage of this approach is the restriction to closed 
systems, which makes it impossible to implement many distributed applications, 
like (mobile) telephony. 

— Facile [TLK96| is an extension of Standard ML for distributed program- 
ming. It provides channels with multiple readers and writers in a network. It also 
provides open distributed systems and a registration mechanism similar to ours. 
But Facile does not provide a linking concept for fault tolerant programming, 
which is needed for the development of distributed applications. It is only possi- 
ble to program with timeouts. Finally, Facile is a strict programming language 
and the communication is implemented as a side effect. We have integrated the 
communication in the 10 monad and hence preserve referential transparency. 

— Finally, we once again compare our approach with Erlang |AWV93J . The 
first advantage of Distributed Haskell is, that messages are statically typed. The 
type of the messages which can be sent through a port are the communication in- 
terface of the process which reads from the port. In Erlang such an interface does 
not exist. Furthermore this type system provides safety in program development. 
For example typing mistakes yield a compile-time error in Distributed Haskell, 
not a deadlock like in Erlang. Another opportunity is our linking mechanism, 
which is more powerful then linking in Erlang. We can add arbitrary ID actions 
to the links, which are performed, if the linked port dies. In Erlang it is only 
possible to receive a message, if another process dies. The process which receives 
this messages must interrupt its execution, react on this message, and afterwards 
resume its work. Performing different actions in case of different exceptions is 
much more difficult, then in our linking mechanism. 

6 Conclusion and Future Work 

We have extended Haskell for concurrent and distributed programming. With the 
example of a chat we have shown, how easy distributed systems can be developed 
using Distributed Haskell. The main concept of Distributed Haskell is the use of 
ports. Ports differ from channels in Concurrent Haskell in the fact that only the 
process, which created a port, can read from it. But this restriction is needed 
for the implementation, especially when we want to develop robust and fault 
tolerant systems. Another opportunity of this restriction is that ports, which 
are not used anymore are detected as garbage and get collected automatically. 
All described extensions are implemented using the Glasgow Haskell compiler 
jCHC| with the libraries Concurrent and Socket and built the library Port, 
which can be used together with the Glasgow Haskell Compiler [GHCJ . 

In future work we want to implement lazy communication through the net. 
This means sending fragments of the heap instead of sending there string rep- 
resentation. With this extension we would also be able to send infinite data 
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structures and functions. Furthermore we want to investigate if this is sugges- 
tive, because calculations can be duplicated. Therefore we want to inspect how 
good distributed programming and lazy evaluation match. 
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Abstract. This paper provides a self-contained formal description of the 
dynamic properties of Hume, a novel functionally-based concurrent lan- 
guage that aims to target space- and time-critical systems such as safety- 
critical, embedded and real-time systems. The language is designed to 
support rigorous cost and space analyses, whilst providing a high level 
of abstraction including polymorphic type inference, automatic mem- 
ory management, higher-order functions, exception-handling and a good 
range of primitive types. 



1 Introduction 

Hume (Higher-order Unified Meta-Environment) is a polymorphically typed 
functionally-based language for developing, proving and assessing concurrent, 
time- and space-critical systems, including embedded and real-time systems 
131TO . The language is intended to give strong guarantees of bounded time 
and space behaviour, whilst providing a relatively high level of expressive power 
through the use of functional language features. It has thus been designed to 
support the fundamental requirements of safety-critical software, as espoused by 
e.g. Leveson [8], whilst raising the level of abstraction at which that software can 
be implemented. Increased abstraction brings the usual advantages of faster de- 
velopment, reduced costs, and the complete elimination of certain kinds of error. 
In the safety-critical arena, however, such abstraction must be limited by the 
need for transparent implementation: high-level constructs must have obvious 
analogues in the low-level implementation, for example. This has traditionally 
been a weakness of functional language implementations. 

A primary goal of the Hume design is to match sound formal program devel- 
opment practice by improving confidence in the correctness of implementation. 
Typical formal approaches to designing safety-critical systems progress rigor- 
ously from requirements specification to systems prototyping. Languages and 
notations for specification/prototyping provide good formalisms and proof sup- 
port, but are often weak on essential support for programming abstractions, such 
as data structures and recursion. Implementation therefore usually proceeds less 
formally, or more tediously, using conventional languages and techniques. Hume 
is intended to simplify this process by allowing more direct implementation of 
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the abstractions provided by formal specification languages. Alternatively, in 
a less formal development process, it can be used to give a higher-level, more 
intuitive implementation of a real-time problem. 

This paper provides a high-level formal description of the behaviour of con- 
current Hume programs, incorporating a simple integrated model of time usage. 
This description is intended to formalise and explicate the language design, for 
the benefit of both users and implementors, but is not primarily intended to be 
used as a basis for direct proof. 

2 The Hume Language 

Hume has a 3-level structure: the coordination layer, based on communicating 
(parallel) processes, encompasses the inner, purely functional, expression layer. 
These two layers are enclosed in a static declaration layer which introduces type, 
function, value, exception etc. definitions that can be used in either or both 
dynamic layers. Exceptions may be raised in expressions, but are handled in 
the coordination layer. Strict cost mechanisms ensure that expressions produce 
results in bounded time and space, and that exception handling costs are also 
bounded. 

2.1 The Hume Expression Layer 

The Hume expression layer is a purely functional language with a strict seman- 
tics. It is intended to be used for the description of single, single-shot, non- 
reentrant processes. It is deterministic, and has statically bounded time and 
space behaviour. In order to ensure these strong properties, expressions whose 
time or space cost is not explicitly bounded must be restricted to a statically- 
checkable primitive recursive form m- 

2.2 The Hume Coordination Layer 

The Hume eoordination layer is a finite state language for the description of 
multiple, interacting, re-entrant processes built from the purely functional ex- 
pression layer. The coordination layer is designed to have statically provable 
properties that include both process equivalence and safety properties such as 
the absence of deadlock, livelock or resource starvation. 

The basic unit of coordination is the box (Figure [T|), an abstract notion of a 
process that specifies the links between its input and output channels in terms 
of functional pattern matches, and which provides exception handling facili- 
ties including timeouts and system exceptions, with handlers defined as pattern 
matches on exception values. The coordination layer is responsible for interaction 
with external, imperative state through streams and ports that are ultimately 
connected to external devices. Our initial design allows the definition of sim- 
ple, static process networks only using a static wiring notation (Figure 121). Only 
values with statically determined sizes may be communicated through wires. 
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<boxdecl> : := 


"box" <boxid> 

"in" <inl> ... 

"out" <outl> " , " . . . 

"match" <matches> 

[ "timeout" <expr> ] 

[ "handle" <handlers> 


" , " <inn> 

" , " <outn> 

] 




<in>/<out> ::= 


<varid> " : : " <exprtype> 






<matches> : := 


<matchl> " 1 " ... " 1 " <matchn> 


n >= 1 


<mat ch> : : = 


<patt> "->" <expr> 






Fig. 1. Syntax of boxes 


<wiredecl> ::= 


"wire" <boxid> <sources> 


<dests> 




<sources>/<dests> : 


:= "(" <linkl> ... 


<linkn> ")" 


n >= 0 


<link> : : = 


<boxid> " . " <varid> I 


<streamid> 





Fig. 2. Syntax of wires 



2.3 Types 

Hume is a polymorphically typed language in the spirit of ML m or Haskell ng. 
It supports a wide range of scalar types, including booleans, characters, variable 
sized word values, fixed-precision integers (including natural numbers), floating- 
point values, and fixed-exponent real numbers. Precisely costed conversions are 
defined between values of those types and the sizes of all scalar types other than 
booleans must be specified precisely. 

Hume also supports four kinds of structured type: vectors, lists, tuples and 
user-defined constructed types. Vector and tuple types are fixed size, whereas 
lists and user-defined types may be arbitrary sized. All elements of a single vector 
or list must have the same type. 



2.4 Example: Railway Junction 

As an example of a basic Hume program, we show a simple controller for a simple 
railway junction (FigureE]) , comprising the junction of two tracks, tl and t2 into 
the single track t3 through the point p. Each incoming track is controlled by a 
signal (si or s2) The controller avoids collisions by ensuring that at most one 
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signal is green and that both signals are red unless t3 is empty. The tracks and 
signals are modelled as Hume streams. 



data Maybe a = 
data Direction = 
data Signal = 
type Point = 
type Speed = 
type Train = 



Just a I Nothing; 
Left I Right ; 

Red I Green; 

Maybe Direction; 
Int ; 

Maybe Speed; 



left = (Green, Red, Just Left); 
right = (Red, Green, Just Right); 



box junction 

in (sensei, sense2, sensed :: Train) 
out (sigl, sig2 :: Signal, point :: Point) 

match 

(_,_,Just _) -> (Red, Red, Nothing) 

(Just _, Nothing, Nothing) -> left 
(Just _, Nothing, Nothing) -> right 

(Just spl,Just sp2, Nothing) -> if spl > sp2 then left else right; 
wire junction (tl,t2,t3) (sl,s2,p); 



3 Costing Hume Programs 

This section gives a cost semantics for Hume programs, which will subsequently 
be integrated with a dynamic semantics in Section S] The rules use a sequent 
style similar to that used for the semantics of Standard ML m, but with a 
number of technical differences aimed at a more intuitive language description. 

cost 

For example, E h exp c means that under environment E, exp has cost 
c. The definitions of the cost domains and environments are given below. Cost 
environments (CostEnv) map variable identifiers to cost values. Arithmetic op- 
erations -P, max etc. are defined on cost expressions in the obvious way as for 
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normal integers, with the addition of an infinite cost value oo, which is treated 
as in normal integer arithmetic. 

E € Env = ( VarEnv, CostEnv, SysEnv ) Environments 

CE G CostEnv = { var i— >■ Cost } Cost Environments 

c,t€ Cost = 0,1 ,..., 00 Cost Values 

Figures 01-0 give rules to statically derive an upper bound on the time cost 
for Hume programs. The rules are a simple big-step operational semantics, with 
extensions to timeouts and exceptions. We have deliberately used a very simple 
cost semantics here rather than the more general form that we are developing as 
part of our work in parallel computation jH] . The latter is intended for arbitrary 
recursion and requires the use of a constraint solving engine, whereas Hume 
is restricted to function definitions that are either primitive recursive or which 
have explicit time and space constraints. The rules given here are also restricted 
to first-order functions. Although expressed in terms of time cost, it would be 
straightforward to modify these rules to cover space costs. 



E h body => Cost 



cost cost 

E h matches => c, c' Eh handlers => c” 
c V oo A c V oo 

cost 

E h matches handle handlers => c -h c' -h c" 

cost cost cost 

E h time => t Eh matches => c, c' Eh handlers c” 

cost 

E h matches timeout time handle handlers =t> min{t, c -I- c') -he” 



Fig. 4. Cost axioms for boxes 



Figure 0gives simplified cost rules for expressions taken from [2. The cost of 
a function application is the cost of evaluating the body of the function plus the 
cost of each argument. The cost of building a new constructor value such as a 
tuple or a user-defined constructed type is the cost of evaluating the arguments 
to that constructor plus one for each argument (representing the cost of building 
each heap cell). An additional cost is added to user-defined constructor values 
to represent the cost of building a cell to hold the constructor tag. The cost of 
raising an exception is the cost of the enclosed expression plus one (representing 
the cost of actually throwing that exception). Finally the cost of an expression 
enclosed within a timeout is the minimum of the expression cost and the specified 
timeout (a static constant). 

Figure El gives costs for pattern matches. Two values are returned from a 
match sequence, which are summed to give the overall cost of the match se- 
quence. The first cost is the total cost of matching all the patterns in the match 
sequence. This places an upper bound on the match cost. The second cost is 
the cost of evaluating the most expensive right-hand-side. This places an upper 
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E h exp Cost 



cost 

E h var 1 



(CE o/E) (var) = c Vi. 1 < i < n, E h exp^ 



Ci 



E h var expi . . . exp„ + ' 



n > 0 



Vi. 1 < f < n, E h expj 



E h con expi . . . exp„ ^ Z)Li Ci + n + 1 

cost cost cost 

E h expj^ Cl Eh expj =4> C2 Eh expg 



C3 



E h if expj then expj else expg => ci + max(c 2 ,C 3 ) 

— y cost 

E h decls E’,c E © E’ h exp d 



E h let decls in exp => c + c' 

cost 

E h exp c 

cost 

E h raise exnid exp c + 1 

cost 

E h expj^ c Eh expj => t 

cost 

E h expj within exp 2 ^ min{c, t) 



Fig. 5. Cost axioms for expressions 



bound on the expression cost. The cost of matching multiple patterns in a single 
match is the sum of matching each pattern. Wildcard patterns (_) cost nothing 
to match, whereas normal variables have a match cost (here specified as one). 
The cost of matching a constructor pattern is the cost of matching all subsidiary 
patterns plus one for the constructor itself. 

Figure U] gives the cost of matching exceptions. Since only one handler can 
ever be matched, the upper bound cost of a handler sequence is the greatest cost 
of any exception handler. The cost of an individual handler is the cost of the 
match plus 1 for returning the (fixed-cost) expression that is its right-hand-side. 

3.1 Costing the Railway Junction Example 

As an example of using the cost rules, we return to the railway junction example 
from Section [2.41 The cost of the junction box is defined as the sum of the 
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Fig. 6. Cost axioms for pattern matches 



cost 

E h handlers => Cost 



E h pat => c Eh exp ^ c' 
c / oo 

cost 

E h { exnid pat — >■ exp } => c + 1 



cost cost 

E h handler c Eh handlers => c' 

cost 

E h handler | handlers max{c, c') 



Fig. 7. Cost axioms for exception handlers 
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costs of the left-hand-sides of the rules, plus the maximum cost of the right- 
hand-sides. The cost of each match can be derived using the match rules, as 
shown in the table below. For convenience, the costs for each match are split 
into their left-hand- and right-hand-side components. 



Rule 


CostLHs{cl) 


Cost RHsicr) 


1 


2 


7 


2 


4 


9 


3 


4 


9 


4 


6 


13 



The worst case cost is thus + max^^icri. This is 16 -I- 13, or 29 time 

units. 

4 Dynamic Semantics 

This section gives a dynamic semantics for Hume, focusing on the coordination 
layer. We restrict our attention to a subset of the Hume declaration and expres- 
sion layers, ignoring lists and vectors, for example. The full dynamic semantics 
is given in the language report m- 

The semantics is given in terms of the semantic domain of values SemVal, 
defined below. The notation (. . .) is used for semantic tuples in the SemVal do- 
main. The notation D* is the domain of all tuples of D: (), (D), {D, D), . . . We 
use subscripts to select tuple components, e.g. v^. For any tuple v, the notation 
fst(v) is equivalent to vi and snd(v) is equivalent to V 2 . 

BasVal = { PrimPlusInt,PrimEqInt, . . . } Basic Values 
BasCon = { True, False, . . . } Basic Constructors 

Con = BasCon -|- con 

V, vs G SemVal = BasVal -|- Con SemVal*-|- Semantic Values 

SemVal* -|- Exn 

X G Exn = ( var, SemVal* ) Exceptions 

Environments are unique maps from identifiers to values. An environment 
is applied to an identifier to give the the corresponding entry in the map. For 
example, if E is the environment { var i— >■ v }, then E (var) = v. The operation 

TOi © 7712 updates an environment nii with the new mappings in - The ttii © 
7712 operation is similar, but allows values in m\ to be “shadowed” by those in m 2 , 
for example, {var 1 — >■ v} © (var H> v’} is (var H> v’}, whereas (var H> v} © 
{var I— >■ v’} is an error. Where an environment comprises a number of sub- 
environments, we use the notation E o/E’ to select sub-environment E from E’. 
Similarly, E ©yg E’ replaces the VE subenvironment of E with E’ etc. 

We define a number of different environments for use in the semantics. Vari- 
able environments (VarEnv) map identifiers to semantic values. System envi- 
ronments (SysEnv) map stream identifiers to the values that appear on the 
associated stream, plus a boolean indicating the availability of the value. Cost 
environments (Section!^ are also used in the dynamic semantics. 



130 



Kevin Hammond 



E € Env = ( VarEnv, CostEnv, SysEnv ) Environments 

IE, VE S VarEnv = { var i— >■ (SemVal + matches) } Value Environments 
SE € SysEnv = { var i— >■ < bool, SemVal >* } System Environment 

bool= { true, false} Booleans 

Finally, a number of special values are used in the coordination semantics. 
Wiring environments (Wire) map box identifiers to their inputs and outputs. 
Process sets (Processes) are sets of processes. I, A, and P are used in the seman- 
tics to distinguish inactive, active and general process set, and a single process 
(Proc) comprises a box identifier, two tuples of identifiers representing its inputs 
and outputs, and an expression representing its body. 

W S Wire = { var i— >■ (var* , var* ) } Wires 

I, A, P s Process = { Proc } Processes 

Proc = ( var, var*, var*, exp ) Process 

We use three forms of rule. Values are determined by rules of the form E h 
exp => V, which means that given the assumptions in environment E, the value 
V can be determined for syntactic value exp. Costs are determined by rules of 



the form E h exp => c (Section |3]). Finally, results of pattern matches are 
determined by rules of the form E,v |= match => v’: under the assumptions in 
E, the result of matching v against the matches in match is v’. 

We use a sequent style for the rules whereby the consequence (below the line) 
holds wherever the premises hold, perhaps under some side-conditions. The rules 
are defined structurally for each syntactic case. Some syntactic forms may match 
more than one rule. In these cases, the premises and side-conditions for the two 
rules must be disjoint. 



4.1 Dynamic Semantics: Declarations 

Declaration sequences (Figure |8]) are processed to generate an environment that 
contains a value environment and a cost environment, plus the cost of evaluating 
the declaration sequence. Only two forms of declaration are interesting. Variable 
declarations are evaluated at the start of a declaration block. They therefore 
have a fixed dynamic cost which is attributed to the declaration sequence, a 
constant value, and no entry in the cost environment (which is used to cost each 
dynamic function invocation). The cost of a function definition (var matches) is 
defined in Figure E] In the simple cost semantics given here, recursive definitions 
are assigned infinite cost (a valid, though highly imprecise upper bound). We 
anticipate producing more refined cost functions for recursive definitions in the 
future. 

Box and wire declarations are processed to give a set of processes and a 
wiring environment as shown in Figure 0 The set of processes and the wiring 
environment are used to define the semantics of Hume programs. The wiring 
environment maps the outputs of boxes or streams to the inputs of other boxes 
or streams. Every box input must be connected to some input stream or box 
output, while box outputs must be connected to some output stream or box input 
(in time, we intend to relax this restriction so that box outputs may be connected 
to multiple inputs). These restrictions are handled in the static semantics. 
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E h decls E, Cost 



Vi. 1 < i < n, E © E’ h deep => VEi,CEi,Ci 

n n 

E’ = ( (0 VE,), (0 CEO, { } ) 

i=l i = l 

E h decli ; ; decl„ ^ E’,^"^^Ci 



E h decl ^ VE, CE, Cost 



E h exp V Eh exp ^ c 

E h var = exp { var i— v }, { }, c 

^ cost 

E ®CE {var !->• oo} h matches c 
E h var matches => { var i— >■ matches }, { var i— >■ c },0 



Fig. 8. Dynamic semantics for declarations 



h box P 



h box boxid in ins out outs match body => { ( boxid, ins, outs, body ) } 



h wire ^ W 



W = { boxid !->■ ( sources, dests ) } 
h wire boxid sources dests W 



Fig. 9. Dynamic semantics for box and wire declarations 



4.2 Dynamic Semantics: Expressions 

The dynamic semantics of expressions (Figure [HIl) is analogous to that for Stan- 
dard ML. The only interesting rules are those for time or space constraints. If the 
cost of evaluating an expression (given using the cost rules defined in Section |H} 
is greater than the specified timeout in a within-expression, then the Timeout 
exception is raised, otherwise the value of the within-expression is the same as 
the encapsulated expression. Similar rules apply to space constraints. 

Exceptions are matched against each handler in a handler sequence as shown 
in Figureim Each exception that can be raised must be handled by precisely one 
handler in the sequence. The value contained within the exception is matched 
against the pattern in the handler using the normal pattern matching rules (also 
Figure EJ. A syntactic pattern match sequence is matched against a concrete 
value. Each match in the sequence is tried in turn. If a match succeeds then the 
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E h exp V 

cost 

E h expj t Eh exp]^ t’ t’ < t 

E h expj V 
E h exp]^ within exp2 =► v 

cost 

E h expj t Eh expj^ t’ t’ > t 

E h expj within expj ( Timeout, ( ) ) 



Fig. 10. Dynamic semantics for within-expressions 



result of the match is used as the value of the sequence. If a match fails, then 
the next match is tried. Precisely one match must succeed. 

The individual pattern match rules given here consider only simple, unnested 
patterns which are either variables or constructors. It is straightforward to extend 
the rules to cover nested patterns or multiple patterns within a match (as for a 
function of multiple arguments) |10| . A variable pattern matches any value. The 
result of the match is the value of the expression in the environment extended 
with the binding of the variable to the value. A matching constructor binds each 
of its formal arguments to the corresponding actual parameter value. 

4.3 Dynamic Semantics: Boxes 

The dynamic semantics of a Hume program is given by repeatedly reducing 
each of its boxes in the context of the declarations and wirings. The result of 
a Hume program is a new environment reflecting the state of any new bindings 
in the system or value environments. Figure [T^ shows this semantics. The set 
of processes defined in the program is split into two subsets: one of inactive 
processes, the other of active processes. These sets are reduced one step to give 
a new environment and new sets of inactive and active processes. This is repeated 
until the set of active processes becomes empty, at which point the result of the 
program is the current environment. 

The set of processes is split into active (A) and inactive processes (I). A 
process is active if input is available on all its input channels, or if a timeout 
has been raised on any input channel. The auxiliary function active is used to 
determine whether a box should be active or not based on the current values on 
its inputs. It can be defined as follows: 

active (E, (wini . . . win„)) = ready (E, (wini . . . win„)) 

V timedout (E, (wini ■ • ■ win„)) 

ready {El, (wini win„)) = /st (E(wini)) = true A 

... A /sf (E(win„)) = true 
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E, V 1= handlers ^ v 



E, V 1= handler ^ v’ 

E, V ^ handler | handlers v’ 



E,v 1= handler FAIL E,v ^ handlers v’ 
E, V ^ handler | handlers ^ v’ 



E,v 1= handler ^ v/FAIL 



V 



( exnid’, v’ ) E,v’ |= pat — >• exp ^ v” 
E, V 1= exnid pat — >■ exp v” 



exnid 



exnid’ 



V = ( exnid’, v’ ) 

E, V ^ exnid pat — >■ exp ^ FAIL 



exnid 7 ^ exnid’ 



E, V 1= matches => v 



E, V ^ match ^ v’ 

E, V ^ match | matches v’ 



E,v 1= match ^ FAIL E, v |= matches => v’ 
E, V ^ match | matches v’ 



E,v 1= match v/FAIL 



E © { var !->• V } h exp => v’ 
E, V 1= var — >■ exp v’ 



V = con < VI , . . . , v„ > 

— ^ 

E © { Vi. 1 < i < n, vari !->■ Vi } h exp ^ v’ 
E, V 1= con vari . . . var„ — >■ exp v’ 

V / con ( VI , . . . , Vn ) 

E, V ^ con vari ••• var„ — >■ exp ^ FAIL 



Fig. 11. Dynamic semantics for exception handlers and pattern matches 
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E,W h P ^ E 

E,W h P ^ I,A E,W h I,A ^ E’ 

E,W h P ^ E’ 

E,W h P,P ^ E 

E,W h I,A ^ E’,r,A’ E’,W h r,A’ E” 
A / { } 

E,W h I, A ^ E”,r,A’ 



E,W h I,{ } ^ E 
E,W h P ^ P,P 

V?. 1 ^ ^ ^ E, W h P* => I*,Ai 

71 n 

I = y E A = y Ai 

i=l i=l 

E,W h { Pl,...,Pn } => I, A 
E,W I- Proc => P,P 

P = ( boxid, ins, outs, body ) W (boxid) = ( wins, wonts } 
I, A = if actwe(E, wins) then { }, { P } else { P }, { } 
E,W h P ^ I,A 



Fig. 12. Dynamic semantics for processes 



timedout (E, (wini . . . win„)) = fst{snd (E(wini))) = ( Timeout, ( ) ) 

V timedout (E, (wiu2 . . . win„)) 

Having determined the set of currently active processes, each of these is 
executed for one step as shown in FigureHSl A process is executed by determining 
the value of each of its inputs, and then executing the body of the process in 
the context of those values. The new values of the inputs and outputs are used 
to update the stream and box input bindings in the system environment as 
appropriate. 

The final set of coordination rules iFigure fTHi define the semantics of execut- 
ing a single box body. There are three cases, corresponding to normal execution, 
an exception or a timeout respectively. 
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E,W h P,P ^ E,P,P 

Vi. 1 < i < card{A), E, W h A; outsi, Ef,EP 

card(A) card(A) 

E’ = U E,^ © y Ep 

i=l i=l 

E © E’,W h I U A ^ r,A’ 

E,W h I,A ^ (E © E’),r,A’ 

Eh P => V, E, E 

W (boxid) = ( wins, wonts ) n = mrd( wins) SE = SE of E 
vs = ( snd(/si(SE (winsi))), snd(/si(SE (wins„))) ) 

E,vs h body vs’ 

SE^ = { Vi. 1 < i < n, winsi i— >■ snd(SE winsp} 

SE° = { Vi. 1 < i < card( wonts), 

woutsi !->■ ( (irtte, vs’i) , {false, {)) )} 

E,W h ( boxid, ins, outs, body ) vs’,SE^,SE° 



Fig. 13. Dynamic semantics for active processes 



E, V h body ^ v 

cost 

E h time t Eh matches => t’ t’ < t 

E, vs h matches => v 

v 0 Exn 

E, vs h matches timeout time handle handlers v 

cost 

E h time => t Eh matches ^ t’ t’ < t 

E,vs h matches => v E,v 1= handlers v’ 
V e Exn 

E, vs h matches timeout time handle handlers ^ v’ 

cost 

E h time ^ t Eh matches t’ t’ > t 

E, ( Timeout, ( ) ) ^ handlers =4> v 
E, VS h matches timeout time handle handlers v 



Fig. 14. Dynamic semantics for box bodies 
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5 Related Work 

Specification Languages. Safety-critical systems have strong time-based correct- 
ness requirements, which can be expressed formally as properties of safety, live- 
ness and timeliness [l|. Formal requirements specifications are expressed using 
notations such as temporal logics (e.g. MTL), non-temporal logics (e.g. RTL), 
or timed process algebras (e.g. LOTOS-T, Timed CCS or Timed CSP). Such 
notations are deliberately non-deterministic in order to allow alternative imple- 
mentations, and may similarly leave some or all timing issues unspecified. It is 
essential to crystallise these factors amongst others when producing a working 
implementation. 

Non-Determinism. Although non-determinism may be required in specification 
languages such as LOTOS, it is usually undesirable in implementation languages 
such as Hume, where predictable and repeatable behaviour is required jT]. Hume 
thus incorporates deterministic processes, but with the option of fair choice to 
allow the definition of alternative acceptable outcomes. Because of the emphasis 
on hard real-time, it is not possible to use the event synchronising approach 
based on delayed timestamps which has been adopted by e.g. the concurrent 
functional language BRISK [5]. The advantage of the BRISK approach is in 
ensuring strong determinism without requiring explicit specifications of time 
constraints as in Hume. 

Synchronicity. Synchronous languages such as Signal, Lustre, Esterel or the 
visual formalism Statecharts obey the synchrony hypothesis: they assume that 
all events occur instantaneously, with no passage of time between the occurrence 
of consecutive events. In contrast, asynchronous languages, such as the extended 
finite state machine languages Estelle and SDL, make no such assumption. Hume 
uses an asynchronous approach, for reasons of both expressiveness and realism. 
Like Estelle and SDL, it also employs an asynchronous model of communication 
and supports asynchronous execution of concurrent processes. 

Summary Comparison. As a vehicle for implementing safety-critical or hard real- 
time problems, Hume thus has advantages over widely-used existing language 
designs. Compared with Estelle or SDL, for example, it is formally defined, 
deterministic, and provably bounded in both space and time. These factors lead 
to a better match with formal requirements specifications and enhance confidence 
in the correctness of Hume programs. Hume has the advantage over Lustre and 
Esterel of providing asynchronicity, which is required for distributed systems. 
Finally, it has the advantage over LOTOS or other process algbras of being 
designed as an implementation rather than specification language: inter alia 
it supports normal program and data structuring constructs, allowing a rich 
programming environment . 
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5.1 Bounded Time/ Space Models 

Other than our own work, we are aware of three main studies of formally bounded 
time and space behaviour in a functional setting mzEQi. 

Embedded ML. In their recent proposal for Embedded ML, Hughes and Pareto |7] 
have combined the earlier sized type system with the notion of region types |19j 
to give bounded space and termination for a first-order strict functional lan- 
guage E . Their language is more restricted than Hume in a number of ways: most 
notably in not supporting higher-order functions, and in requiring programmer- 
specified memory usage. 

Inductive Cases. Burstallj2] proposed the use of an extended ind case notation 
in a functional context, to define inductive cases from inductively defined data 
types. Here, notation is introduced to constrain recursion to always act on a 
component of the “argument” to the ind case i.e. a component of the data type 
pattern on which a match is made. While ind case enables static confirmation of 
termination, Burstall’s examples suggest that considerable ingenuity is required 
to recast terminating functions based on a laxer syntax. 

Elementary Strong Functional Programming. Turner’s elementary strong func- 
tional programming 1201 has similarly explored issues of guaranteed termination 
in a purely functional programming language. Turner’s approach separates fi- 
nite data structures such as tuples from potentially infinite structures such as 
streams. This allows the definition of functions that are guaranteed to be primi- 
tive recursive. In contrast with the Hume expression layer, it is necessary to iden- 
tify functions that may be more generally recursive. We will draw on Turner’s 
experiences in developing our own termination analysis. 

Other work on Bounded Time/Space. Also relevant to the problem of bounding 
time costs is recent work on cost calculi mm and cost modelling US], which has 
so far been primarily applied to parallel computing. In a slightly different context, 
computer hardware description languages, such as Hydra PS], also necessarily 
provide hard limits on time and space cost bounds, though in a framework 
that is less computationally based than Hume (the use of timeout exceptions to 
limit cost for potentially unbounded computation, and the provision of highly 
general forms of computation seem best suited to software rather than hardware 
implementation, for example). An especially interesting and relevant idea from 
the latter community is the use of polymorphic templates for communication 
channels (Hume wires), which can be specialised to monomorphic, bounded-size 
instances for implementation. A detailed comparison of language designs for 
hardware description with those for safety-critical systems should thus reveal 
interesting commonalities. 
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6 Conclusion 

This paper has introduced the novel programming language Hume using a formal 
descriptive semantics. An unusual feature of this semantics is that it is integrated 
with a simple static time analysis for first-order programs. The semantics given 
here covers all essential features of the language including programs, the full co- 
ordination layer, unusual expression language features, exception handling and 
pattern-matching. Since the semantics is primarily intended to act as a formal 
description of Hume rather than a simple basis for a proof mechanism, we have 
not attempted to prove interesting properties of Hume programs. It should how- 
ever, be possible to use this semantics to prove properties of the language such 
as guaranteed bounded time and space behaviour, and we have given a simple 
example of the use of our cost calculus to determine static bounds on program 
execution cost. In the longer term, we hope to be able to extend this work to 
deal with higher-order functions using techniques developed for costing parallel 
programs |9], and to provide better bounds for recursive function definitions im- 
part of this work will involve constructing analyses to determine properties of 
primitive/nested recursion. 
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Abstract. Usage analysis aims to predict the number of times a heap 
allocated closure is used. Previously proposed usage analyses have proved 
not to scale up well to large programs. In this paper we present a powerful 
and accurate type based analysis designed to scale up for large programs. 
The key features of the type system are usage subtyping and bounded 
usage polymorphism. Bounded polymorphism can lead to huge constraint 
sets so to express constraints compactly we introduce a new expressive 
form of constraints which allows constraints to be represented compactly 
through calls to constraint abstractions. 



1 Introduction 

In the implementation of a lazy functional language sharing of evaluation is 
performed by updating. For example, the (unoptimised) evaluation of 

{Xx.x + a;) (1 + 2) 

proceeds as follows. First, a closure for 1 + 2 is built in the heap and a reference 
to the closure is passed to the abstraction. Second, to evaluate x + x the value 
of X is required. Thus the closure is fetched from the heap and evaluated. Third, 
the closure is updated with the result so that when the value of x is required 
again the expression needs not be recomputed. 

Measurements by Marlow show that 70% of all closures are used at most once 
and that it is therefore unnecessary to update them. Usage information also en- 
ables a series of program transformations such as more aggressive inlining and 
let-floating jJ"WMfl5fWP.Iflfll(IS00| . It is therefore no surprise that considerable 
effort has been put into static analyses that can discover if a closure is used at 
most once |Ses!I1 IT.CH+fl2IMa,r98ITWM95IFa,x95IR.T9BjMog97fCus98IWPJ^ . 
This line of research has produced analyses with increasing accuracy, and bench- 
marks have shown that for small programs they discover a large portion of clo- 
sures used at most once. However these analyses are monovariant and do not 
take the context where a function is called into account. When analysing large 
programs it is crucial to take the context into account - when Wansbrough and 
Peyton Jones implemented the recent analysis from jWPJ99j into the Glasgow 
Haskell Gompiler they discovered that it was almost useless in practice since it 
did not scale up for large programs. |WPJ00| . 
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In this paper we present a powerful and accurate type system which attempts 
to solve this problem. It takes the context where a function is called into account 
through bounded usage polymorphism. We designed our type system by putting 
together and extending the best ideas from previous work. The salient features 
of the type system are these: 

— Our system has full-blown bounded usage polymorphism and supports usage 
polymorhic recursion. 

— In |WPJ98| Wansbrough and Peyton Jones give an overview of the design 
space for how to treat data structures. We choose the most aggressive ap- 
proach which corresponds to the hard- wired treatment of lists in |TWM95] . 

— Our system is based on subsumption between usage types. The use of sub- 
typing in usage analysis goes back to Faxen [Fax95| . 

— We have a three-level type language which incorporates separate notions of 
usage of closures and usage of values which gives increased precision. To 
separate the usage of closures and values is an idea due to Faxen |Fax95| . 

— We have expressive update annotations which allow us to express more ag- 
gressive optimisations than previous analyses. 

Having all these features is not very useful unless there is an efficient inference 
algorithm for the type system. Here bounded polymorphism presents a problem. 
See for example Mossin’s thesis |Mos97| for an account of the problems with 
bounded flow polymorphism in type based flow analyses. The core of the problem 
is that the quantified variables in a type schema may be constrained by a huge 
number of constraints. In the naive inference algorithm first presented by Mossin 
the number of constraints may be exponential in the size of the program. Mossin 
refines the algorithm by adding a constraint simplification phase which renders 
an inference algorithm which is 0{n7). 

A novelty in our work is a new expressive form of constraints which allows 
constraints to be represented compactly through calls to constraint abstractions. 
To efficiently compute least solutions to constraints with constraint abstractions 
is an involved problem and is the subject of a companion paper [IGSOlj . There 
we show how to efficiently compute a least solution to constraints in a constraint 
language with constraint abstractions and inequality constraints over a lattice. 
Using these techniques we can obtain an inference algorithm for our usage analy- 
sis which is 0{n^) where n is the size of the explicitly typed program. We believe 
that constraint abstractions can be very useful for a range of program analyses 
which features bounded annotation polymorphism and in [GSOIJ we show how to 
apply the ideas to a flow analysis with bounded flow polymorphism. Other can- 
didates may be effect analysis, e.g., [TJ94J . binding time analysis, e.g., |DHM95] , 
non determinism analysis, e.g., [IPSOO] and uniqueness type systems, e.g., [IBS96] . 

1.1 Outline 

This paper is organised as follows. Section [5] introduces the language and its 
semantics. Section 0 presents the type system. Section [H describes related work. 
Section 0 concludes. 
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2 Language 

In this section we will present our language and its semantics in the form of an 
abstract machine. 



2.1 Syntax 



The language we use is a lambda calculus extended with integers, lists, case- 
expressions and recursive let-expressions. We omit user defined data structures 
to simplify the presentation but it is a straightforward matter to add them 

jMnj . 



Variables 
Values V 

Expressions e 

Bindings h 

Alternatives alts 

Annotations k 



x,y,z 



= Xx.e I n I nil | cons x y 

= \ X \ e X \ eo -l-'^ ei | let 6i, ...,&„ in e | case e of alts 

= x='^ e 

= {nil eo; cons xy ^ ei} 

= 1 I O’ 



We annotate bindings, values and -I- with usage annotations 1 and u ranged over 
by K. The intuitive meaning of 1 and lo is that the annotated binding (or value) 
may be used at most once and any number of times respectively. 

A distinguishing feature of the syntax is that arguments (in applications of 
terms and constructors) are restricted to variables. We will occasionally use un- 
restricted application cq ei as syntactic sugar for let x=“ ei ineo x where a; is a 
fresh variable. The purpose of the restricted syntax is to make the creation of clo- 
sures explicit via a let-expression which greatly simplifies the presentation of the 
abstract machine as well as the analysis presented in this paper. The syntactic re- 
striction is by now rather standard, see for example |P JPS96ILau93ISes97IGS99| . 



2.2 Semantics 

We will take Sestoft’s abstract machine |Ses97j as the semantic basis of our work. 
The machine can be thought of as modelling lower-level abstract machines based 
on so called update markers, such as the TIM [l’W87j and the STG-machine 
[IP,I92| . A correspondence between Sestoft’s machine and Launchbury’s natural 
semantics for lazy evaluation [I;a,ii98| has been shown in [Ses97| . For the purpose 
of the abstract machine we extend the set of terms to include expressions of 
the form addj) e, which represents an intermediate step in the computation of 
n” -1-'^ e. We define a reduction relation e i— >■ e' between terms: 

(Xx.e)'^ y e[x:=y] n” -1-'^ e >->■ addj{ e add()^ [uq + rii']'^ 



/ case nil'^ of \ 


1 J 


/ case (cons x y)'^ of \ 


nil ^ eo 




nil => eo 


\ cons x' y' => ei y 


f I 


( cons x' 2 /' => ei J 



ei[x':=x,y':=y] 
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{H ; let 6 in e ; S) 
{H,x='^ e\ x\ S) 
{H,x=^ e-, x; S) 
{H ■ R[e] ; 5) 

R,S) 

#x,S) 

#x,S) 



^ Let 
Var-w 
Var-^1 
U nwin d 
Reduc e 
M arker -g; 
Marker-1 



(H,b- e; S) 

(H; e; #x,S) 

{H-,e;S) 

{H-e-,R,S) 

{H-, e; S) if n> e 
{H, a; =“ ; S) 

{H-v^- S) 



Fig. 1. Abstract machine transition rules 



Note that no reduction depends on an annotation. The annotations are instead 
taken into account in the abstract machine transition rules. 

Configurations in the abstract machine are triples {H ; e ; S'), where H is 
a heap, e is the term currently being evaluated and S is the abstract machine 
stack: 

Heaps H ::= 6i, . . . , 

Stacks S ::=e|i?,S|#a:,S 

Reduction contexts R ::= [•] a; | [•] +” e | add” [•] | case [•] of alts 

A heap consists of a sequence of bindings. The variables bound by the heap must 
be distinct and the order of bindings is irrelevant. Thus a heap can be considered 
as a partial function mapping variables to terms and we will write dom(iJ) for 
the set of variables bound by H. We will write Hq,Hi for the concatenation of 
Hq and Hi. An abstract machine stack is a stack of shallow reduction contexts 
and update markers. The stack can be thought of as corresponding to the “sur- 
rounding derivation” in a natural semantics, where the role of an update marker 
4t=x is to keep track of a pending update of x. The update markers on the stack 
will be distinct, that is there will be no more than one pending update of the 
same variable. We will consider an update marker as a binder and we will write 
dom(5') for the variables bound by the update markers in S. Consequently, we 
will require the variables bound by the stack to be distinct from the variables 
bound by the heap. We will also require that configurations are closed and we 
will identify configurations up to a-conversion, that is renaming of the variables 
bound by the heap and the stack. We will also identify configurations up to 
garbage meaning that we may remove or add bindings and update markers to 
the heap as long as the configuration remains closed. An initial configuration is 
of the form (e ; e ; e), where e is a closed expression. The transition rules of the 
abstract machine are given in Figure [H The rule Let 

(H; let6ine; S') (77,6; e; S) 

creates new bindings in the heap. For the rule to be applied the variables bound 
by 6 must be distinct from the variables bound by H and S. This condition can 
always be met simply by a-converting the let-expression. The rule Var-o; 
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gives semantics to bindings annotated with oj. The rule states that an update 
marker shall be pushed onto the stack so that the variable x eventually may be 
updated with the result of evaluating e. The removal of the binding corresponds 
to so called black-holing: if the evaluation of e to a value depends on x (i.e., x 
depends directly on itself) the computation will get stuck, since x is no longer 
bound by the heap. Note that we still consider the configuration to be closed, 
since x is bound by the update marker on the stack. The rule Var-1 



gives semantics to bindings annotated with 1. Such bindings may only be used 
once so there is no need to update the binding and thus no update marker is 
pushed onto the stack. Note that we require configurations to be closed so the 
rule does not apply unless the configuration remains closed. An example of where 
the rule does not apply is the configuration 



which cannot reduce further since there is a reference to x on the stack. This 
restriction is important since an open configuration would correspond to dangling 
pointers in an implementation. If the rule does not apply the computation will 
go wrong, and we will consider the configuration and the term it originates from 
to be ill-annotated. The key property of the type system presented in this paper 
is that if a term is well- typed then it cannot go wrong. Note that, the insistence 
that configurations remain closed is a stronger requirement than the intuitive 
“used at most once” criterion, which says that it is safe to avoid updating a 
closure if it is used at most once. For example, according to the weaker criterion 
it is safe to not update x in 



because x is only used once, but according to our criterion it is not safe. Our 
stronger criterion is useful for two reasons. Firstly, with dangling pointers special 
care has to be taken so that the garbage collector does not follow them - and 
there is a cost associated with that. Secondly, usage annotations can be used to 
justify certain program transformations, such as more aggressive inlining. Gus- 
tavsson and Sands |GS99| have shown that the stronger criterion can guarantee 
that these transformations are time and space safe, but with the weaker “used 
at most once” criterion the transformations can lead to an asymptoticly worse 
space behaviour. The rule Unwind 




(cc 1 -|-“ 2 ; X ; [•] -k” x, e) 



let a:=l-|-2ina;-|- (Ay. 3) x 



{H ■ R[e] ■ S) 



Unwind 




allows us to get to the heart of the evaluation by “unwinding” a shallow reduction 
context. When the term to be evaluated is a value the next transition depends 
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on whether an update marker or a reduction context is on top of the stack. If it 
is a reduction context the rule Reduce 

{H;v;R,S) {H ; e ; S) if R[v] ^ e 

applies, the value is plugged into the reduction context and a reduction can take 
place. If the top of the stack is an update marker, what happens depends on the 
annotation on the value. If it is to the value may be used several times and we 
apply the rule Update-w 

S) {H, X =“ ; S) 

which takes care of the update marker and performs the update. If the value on 
the other hand is annotated with 1, the value may only be used once so the rule 
Update- 1 

(i/; #x,5) {H- v^-S) 

throws away the marker without performing the update. Again, note that the 
rule does not apply unless the configuration remains closed. So, for example, 

(e; 3^ ; #x, [•] -k'‘x,e) 

goes wrong and we consider the configuration to be ill-annotated. 

3 Type System 

The semantics in Section |2]specifies that for a binding x=e to be safely annotated 
with a 1 it is required that whenever the binding is used through the rule 

{H,x=^ e] x\ S) {H-, e; S), 

the configuration must remain closed. Thus there may only be one (non-binding) 
occurrence of x in the configuration, namely the one that is dereferenced. Simi- 
larly, to safely annotate a value with 1 it is required that if and when the value 
is used and there is an update marker ^x on the stack 

then there is no live occurrence of x in the configuration so that the configuration 
remains closed. Our type system (and most other type based usage analyses) 
is based on the following simple idea. If, when a binding x = e is created, x 
occurs only once in the configuration and x never gets duplicated during the 
computation then x will occur only once if and when it is dereferenced. 0 

^ We will strengthen this idea in an obvious but important way - when a variable 
occurs once in several branches of a case-expression. Then, since eventually only one 
branch will be taken, we may consider it as occurring only once. 
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3.1 Type Language 

In order to construct a type system for the annotated language we need a cor- 
responding annotated type language. We start by extending the annotation lan- 
guage from the previous section to include annotation variables. 

Annotations k ::= 1 | w | fc | j 

We will use two kinds of variables, type annotation variables, ranged over by k, 
and program annotation variables, ranged over by j. Type annotation variables 
may occur in the annotations on a type but not in the annotations on a program. 
Conversely, program annotation variables may occur in programs but not in 
types. 

The structure of the type language closely follows the structure of the term 
language and we will have one kind of type for every syntactic category. We let 
p range over value types which is the form of type we will assign to values. 

Type Variables a 

Value Types p ::= a | Int | cr — r | List kq ki K2 p 

Our value types contains type variables, an integer type, function types and the 
list type. The function types relies on a notion of binding types, ranged over by a, 
and expression types, ranged over by r, which we will introduce below. Expression 
types are used to give types to expressions and are defined as follows. 

Expression Types r ::= p^ 

An annotated value will be given a type of the form p” and a non- value e will 
be given a type such that the annotated value of e (if e terminates) will have 
that type. Thus, for example, saying that a term has a type p“ means that the 
value of the term may be used any number of times. Binding types which we 
will use to give a type to bindings are defined as follows. 

Binding Types u ::= 

A binding x ='^ e may be given a type of the form where r is the type of 
e. We also use binding types to give a type to a variable when we can think of 
the variable as a reference, for example when we pass it as an argument to a 
function. A type of a variable is then simply the type of the bindings it may refer 
to. Recall that we used expression types and binding types in the type <t — >■ r 
of a function. A function of this type can be applied to a variable (remember 
functions can only be applied to variables due to the syntactic restriction in our 
language) with the binding type cr and then it will return something of type t. 
We can also use binding types to logically justify our type List kq ki K2 ks p of 
lists. We can obtain this type simply by annotating the right hand side of the 
data type definition 

List a = nil | cons a (List a) 

such that the arguments to the constructors are binding types, as follows. 
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List fco k\ k2 ks a = nil | cons (List ko k\ /c2 ^3 0)^2 

The reason for why the arguments to the constructors should be binding types 
is simply because constructors, due to the syntactic restriction, may be applied 
only to variables. 

3.2 Subtyping 

A key observation which we will use to justify our subtyping relation is that 1 
operationally approximates w, i.e., if we in any term e replace any occurrence 
of 1 with oj then the modified term will run successfully without going wrong 
if and when e does. We define the subtyping relation on closed types where the 
ordering on annotations is the operational approximation 1 < w by the following 
rules. 

g' < g T < t' pZ < p'% 4 < K2 4 < ^3 

a ^ T < a' ^ t' List kq ki K2 ks p < List kq' k\' K2' K3' p' 

p < p' k' < K T <t' k' < K 
Int < Int pK < Tk < t'k' 

Note that the subtype ordering is contravariant with respect to the ordering on 
the annotations. The rule for lists can be understood by unfolding the annotated 
data type definition for lists. 

3.3 Constraints 

In order to extend the subtyping relation to types with type variables and an- 
notation variables we need the notion of constraints. To be able to represents 
constraints compactly we introduce a new form of constraints which may con- 
tain calls to constraint abstractions. A constraint abstraction is simply a function 
that given some annotation variables returns a constraint. We will let </> range 
over constraint abstractions, I range over constraint abstraction variables and II 
range over constraints. 

Annotation constraints II ::= kq < ki | IIq^ 7Ti | let </>in id | 3fc.iI | I it 
Constraint abstractions </> ::= I k = II 

Constraint abstractions allow different substitution instances of a constraint to 
share the same representation. For example to represent instances of the con- 
straints ko < ki, fci < fc 2 we can define an abstraction 

I ko fci k2 = ko < ki, fci < fc2 

and represent (kq < ki, ki < K2), (ks < K4, < K5) as 

let I ko fci k2 = ko < k\, fci < fc 2 in I ko k\ K2, I K3 K4 K5. 
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Thus with constraint abstractions the size of any instance is linear in the number 
of free type annotation variables of the constraint but the size of the original 
constraint may be quadratic in the sum of the number of free type annotation 
variables and free program annotation variables (or even worse if it contains 
existential quantifiers). With constraint abstraction we can avoid the exponential 
explosion of constraints which can happen with a naive approach. To see why 
consider a program of the following form. 

let /o = ... 

in let /i = . . . fo . . . fo . . . 
in let . . . 

in let /„ = ... fn-i . . . fn-i ■ ■ ■ 
in 

The first naive algorithm, for the similar problem of flow analysis with bounded 
flow polymorphism, presented by Mossin [IMoshTJ which suffers from the expo- 
nential explosion problem would proceed as follows. It first infers the polymor- 
phic type for /q. Then to compute the type for fi it instantiates the type of 
/o twice and thus make two instances of the constraints contained in the type 
schema so the constraints for fi will be at least twice as big. This is repeated n 
times and thus the size of the resulting constraints will be exponential in the call 
depth n. In practice the call depth typically does not grow linearly with the size 
of the program but the call depth does tend to increase with program size which 
makes this into a problem that occurs in practice. With constraint abstractions 
we can avoid the problem and represent the constraints as follows 

let lo ko= ... 

in let fci = . . . Zq /cq . . . Ig k'^ . . . 
in let ... 

in let kji — ... In—i k ^_^ . . . In—i k ^_-^ . . . 
in . . . Z„ Zcq . . . /„ fcg . . . 

To give semantics to constraints we will use closing substitutions from type 
variables to value types and annotation variables to annotations, ranged over by 
d. The meaning of a constraint II is given by a relation 0 |= 7T (read as (p 
models II) defined coinductively by the following rules. 

Ko'd < til'd d;(f \= TTq \= IIi -d; \= II 

•d; (f \= kq < Ki 'd;$ \= Ho, 7Ti 'd;$ \= let $' in II 

d-,$Y^ n[k ■= k] 'd;$\= n[k := k] 

3k. n 'd]p\= I K 

We will sometimes write -d ^ 77 as a shorthand for z?; e |= 77. We will let W range 
over constraints concerning type variables. 

Type variable constraints W ::= Og < Oi | i7o, >7'i | 3a. <7 
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The meaning of a constraint W is given by a relation i!) \= (read as d models 
If'). We define t? ^ inductively by the following rules. 

■d(ao) < 'd(oi) "d H ^0 t? |= •f'l d[a := \= ' 1 ' 

1? ^ oo < ui t? 1= >fo) H 3a.'?' 

We will let 0 range over pairs U ; W and we define d\=0as'0\=II;'FiS^\=n 
and ^ \= 'P. The whole purpose of having constraints is that they allow us to 
extend the subtyping relation to types with variables. We will define a relation 
0 \= Po ^ Pi where po and pi may be open types, which reads: po < pi is a 
consequence of 0. It is defined as 0 \= po < Pi iff for every r?, if r? ^ 6> then 
Pot? < Pit?- We also define 0 \= tq <ti and 6> |= CTo < cti in the same manner. 

3.4 Type Schemas 

Our type system incorporates bounded polymorphism so we need type schemas 
where the quantified variables are bounded by some constraints. 

Type Schemas x ••= a. p\0 

We will define a relation 0 \= X ^ P which reads as: it is a consequence of 0 that 
X can be instantiated to p. It is defined as 6> ^ (Vfc, a. p \ 0') -< := if, a := p] 

iff for every P, if P ^ 6> then i? o [fc := k, a := p] \= 0' . We will sometimes 
consider a value type p to be a type schema with no quantified variables and no 
constraints. 

3.5 Contexts 

We use r and A to range over typing contexts which are multisets of type 
associations of the form x : x% (and since we may consider a value type p as a 
type schema there may also be type associations of the form x : p” ). As usual 
we will use contexts when we give a type to a term with free variables. Thus 
we will say that e has the type r in a context F if we can give e the type r 
assuming that the free variables in e has the types given by T. However the 
context also plays another important role; it records the number of times each 
variable occurs in the term. Thus if x occurs n times in e it also occurs n times 
in r (with one important exception, namely if x occurs in different branches of 
a case-expression). This may be a bit surprising at first. Consider for example 
the term {\y.y x with the free variable x. We will be able to say that 

this term has the type Int^ in the context x : Int“. According to the reduction 
relation the term can reduce to x x so we would expect to be able to give 
x-|-^ X the same type in the same context. However this will not be possible since 
X now occurs twice in the term. Instead we can type the term in the context 
X : Int“,x : Int“ where x occurs twice. To be able to state a relation between 
the contexts before and after a reduction we define a rewrite relation on contexts. 

r, X : xZ ^ r, X : xZ, X : xZ r, x : xl ^ T 
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. , 6>; ro,A l~ e : r x ^ dom(ro) 

^6>; Jb h Xx.e ■. a ^ t & \= x : a — >•* A 



Int 



6>; 0 h n : Int 



Nil 



0; 0 h nil : List Kq Ki K2 Ks p 



Cons 



: XOKo,y ■ XiS 1“ cons a: 2 / : p' 




Fig. 2. Typing rules for values 



We have two rewrite rules. The first says that a type association of the form 
X : xZ be duplicated. This is supposed to model the duplication of a variable 
X during the computation. Note that we may not duplicate a type association of 
the form x : Xi- This reflects our intention that a variable that refers to a binding 
which will not be updated, must not be duplicated. The second rule simply allows 
us to remove a type association. This corresponds to the case when a variable is 
dropped during the computation (for example since it occurred in a branch of a 
case-expression that was not selected) . These rewrite rules will play a role similar 
to the contraction and weakening rules in logic. The restricted duplication (i.e., 
that we may only duplicate type associations of the form x : xZ) corresponds to 
the restricted form of contraction in linear logic |Gir ill. We extend the relation 
to contexts with open types in the same way as with the subtyping relation by 
defining 0 \= Fq — >■* A iff for every -d, if -d ^ 0 then T’o'd — >■* Fii}. Finally we 
will also need the relation 0 \= if k = oj then F -x* F, F which holds iff for 
every i?, if -d ^ 0, and ni) = uj then Fd -X* Fd^Fi). 

3.6 Typing Judgements 

Typing judgements for values take the form 0\ F \- v \ p and shall be read: under 
the constraints 0 and in the context F , the value v can be given the value type 
p. Similarly we will have typing judgements for expressions, alternatives and 
bindings. As discussed in the previous section the context F in our judgements 
as usual keeps track of the types of the free variables in the term but it also 
records the number of times each variable occurs in the term. 

3.7 Typing Rules 

The typing rules for values are in Figure El The key feature of the rule Abs 

0; Jo, A b e : T x ^ dom(A) 

0; Fq F Aa;.e : a ^ t 0 |= a; : ct — >■* A 

is that if X occurs more than once in e then the abstraction will be assigned a 
type of the form pJJ, — r where k and k' are constrained to be to indicating that a 
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Value 



0; r V p 0 \= If Hi = u) then F — >■* F, F 



Var 



0\F \- v'^ : 0 \= k' < K 

0 N X A p 



Alts 



, 0-,F^e-.{pZ^rT 

0\ X : XkI y- X -.t 0\= p’^^ <t 0\F,x \ Xki ex : r " ^ ^ 

6>; To I- eo : Int”“ 6>; A h d : , 

Plus — -= 0 \= K < K 

0; Fo,Fi h eo +" ei : Int'= 

0; Tq, A h ep : r 0; Up, U 2 , Ta h ei : r 

0; ro,Fi,F 2 h {nil ep; cons a: y ^ ei} : p' ^ r "2 2 " p 

ey [= X . Pkq , y ■ P k2 3 



p^ = List Kq Kl K2 K3 p 
x,y ^ dom(A, A) 



Case 



0; A h e : p” 0; A h alts : p ^ t 
0; A, A h case e of alts : r 



Fig. 3. Typing rules for expressions 



variable will be duplicated if it is passed to the abstraction. This is accomplished 
by first typing e in a context A; A where x ^ dom(A)- Then, if x occurs more 
than once in e, x will occur more than once in A • Now the second side condition 
specify that we must be able to rewrite x : p^, to A which clearly involves 
duplicating x : p”, (since x occurs more than once in A) which will constrain k 
and k' to be w. The typing rule for integers is straightforward and the rules for 
lists can be understood by unfolding the annotated data type definition for lists. 

We have divided the typing rules for expressions into two figures. Most rules 
appear in Figure |3 but the rules which concern let expressions are in Figure ID 
The rule Value 



O] r \- V \ p 0 2 If k' = oj then F -P-* F, F 

0\ F 'r v'^ ■. p'^' 0 ^ k' < K 

is used to type an annotated value. Saying that an annotated value has the type 
p” means that if k' is uj the value may be used any number of times and thus 
it will take care of any update marker on the stack. Taking care of an update 
marker means updating with the value, thus duplicating any free variables of 
the value. The purpose of the side condition 0 |= if k' = tu then F — >■* A ^ is 
to ensure that these variables may safely be duplicated if k' is constrained to be 

UJ. 

In order to type case-expressions we introduce an auxiliary form of judge- 
ments for alternatives. We give alternatives a type of the form p => r where p is 
the type of the value that is being scrutinised and r is the type of the branches. 
The rule Alts 
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Binding 



J7o; 'I/', r \- e \ p 



KO 



U\ r \- X ='^ e ■. {x \ (Vfci, tti. p I lk 2 \ where l %2 = 3ko.IIo 



{*) 



Binding group-e— 



e h e : £ where e 



ko ^ ftav{r,po), ao ^ fty{r,po), 
i*)ki ^ ftav{r,Ko, lk 2 = 3ko-IIo), 

Si ^ ftv{r), n \= Ki < K 



77; Jo h 6 : (a: : xltn) where (f> II-, Fi \- b : A where ^ 

Binding group =; =; 

77; To, Fi \- b, b ■. {x : where 0, (f> 



Let 



77o; To, Fi \- b ■. A where 4> ^7i; !7; Ta, To h e : t 

77; iT; T’l, To h let b in e : r 



dom(A, To) n dom(Zi) 
77 1= Z\ To; T 2 
77 1= 77o, let $ in 77i 



0 



Fig. 4. Typing rules for bindings and let expressions 



-^ 0 ; A l~ ep ■ T <9; 7 q, 7^2, Ts h ei : r 
0 ; To; ri,T 2 h {nil => cq; cons xy A- ei} : p' ^ t 



p' = List Ko Kl K 2 K 3 p 
x,y ^ dom(ro,r 2 ) 

0 h a; : : p'!!" ^3 



for alternatives contains a subtle treatment of contexts. If a variable occurs once 
in each branch of the case-expression and thus twice in the term it may still occur 
only once in the context. This is achieved by collecting the variables that occur 
in both branches in a common context Tq, thus effectively counting a variable 
occurring in both branches as one. Finally, the side conditions take care of the 
variables bound in the cons-pattern. They see to that if x (and/or y) occurs 
several times in ei then kq and ki (and/or K 2 and K 3 ) will be constrained to 
be Lu. Thanks to the auxiliary rule for alternatives the rule for case-expressions 
becomes entirely straightforward. 

To type let-expressions we first introduce an auxiliary form of typing judge- 
ments for bindings. We will give bindings a type of the form x : i.e., the type 

of a binding includes the name of the bound variable (so it can be considered as 
a type association). The rules for typing bindings appears in Figure [H To type 
a binding with the rule Binding 



77q; IT; T h e : 

n-r\-x='^e: {x : (Vfci, ai. p \ lk 2 ', ^So-^)ki) where llt 2 = 3ko-F[o 

ko ^ ftav{r,p‘^), 3o ^ fty{r,p^), 

(*) kl ^ ftav{r, Ko, lk 2 = 3ko.IIo), 
di ^ ftv{r), n \= Kl < K, 



we first type the expression in the binding and yield the constraints TTo; 'F. We 
may then existentially quantify variables which appear in the constraints to 
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obtain 3kQ.UQ and doo-'f' providing kg and oq do not occur free elsewhere in the 
judgement. This is ensured by the first line of side conditions. We then form 
the type schema Vfci, ci. p \ lk 2 \ 3ao.W by universally quantifying ki and Oi. The 
second line of side conditions simply ensures that k\ and ai do not occur free 
elsewhere in the judgement. We put dap. S' in the type schema but not 3kQ.UQ. 
Instead we introduce a constraint abstraction lk 2 = 3ko.IIo and put a call to the 
constraint abstraction into the type schema. We also need a form of judgements 
for groups of bindings. As you would expect the type of a group of bindings is 
just a set of type associations (i.e., a typing context) and the typing rules just 
collect the type associations and the corresponding constraint abstractions. In 
the rule Let 



TJq; To, Fi \- b : A where $ TTi; T; T2, T3 h e : t 
7T; T; Ti, T 3 h let 6 in e : r 



dom(Ti, T 3 ) n dom(Z\) = 0 
n^A^* To; T2 
n ^ ilo , let $ in III 



we first type the bindings which gives a context A which contains the type 
schemas associated with each binding. The first two side conditions ensures that 
the type schema Xi^l associated with each variable Xi in A is consistent with the 
type of each use of Xi. They also ensures that if Xi may be used more than once 
then Ki and k' must be constrained to uj. It is achieved as follows. If Xi occurs 
more than once in e and the right hand sides of b then Xi will also occur more 
than once in Tq, T 2 . Thus the second side condition will ensure that Ki and k) is 
constrained to be w. The typing of the bindings also gives a group of constraint 
abstraction (j). With the constraint abstraction we form the constraint letc^inTTi 
which by the third side condition must be a consequence of the constraints in 
the conclusion of the rule. 



3.8 Soundness 

The soundness of our type system simply says that a well typed program is well 
annotated, i.e., when we run it in the abstract machine it does not go wrong. 

Theorem 1. // 0; 0 h e : r and d \= 0 then ed cannot go wrong. 

The result is established by extending the type system to abstract machine con- 
figurations and then proving a subject reduction result which says that typings 
are preserved by transitions in the abstract machine. A very similar proof for 
the type system in |Gus98] is presented in full detail in |Gus99j . 

3.9 Inference Algorithm 

As stated the type system is undecidable since it employs type polymorphic re- 
cursion. Our inference algorithm will therefore take a term which is explicitly 
typed in the underlying ordinary type system and can handle type polymorphic 
recursion if presented to it through the type annotations. It will first compute a 
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usage typing judgement which is principal with respect to the given typing judge- 
ment, i.e., every other usage typing judgement is an instance of the computed 
judgement if “stripping the annotations” from it yields the judgement in the 
underlying type system. The second phase of the algorithm then computes the 
best solution to the constraints in the principal judgement using the techniques 
described in a companion paper | GS01| . 

The time complexity of the algorithm is dominated by the cost of the con- 
straint solving in the second phase. We can argue, as follows, that the time 
complexity of the second phase is 0{n^) where n is the size of the explicitly 
typed term. Let the skeleton of the constraints be the constraints where all oc- 
currences of inequality constraints of the form kq < k\ have been removed. What 
remains are the binding occurrences of variables and all calls to constraint ab- 
stractions. By inspecting the typing rules we can see that the size of the skeleton 
of the constraints required to type a program is proportional to the size of the 
explicitly typed program. Moreover the number of free annotation variables in 
the constraints are proportional to the size of the program. From these facts and 
theorem 2 of |GS01] we can conclude that the complexity is O(n^) where n is 
the size of the typed program. 

For a version of the analysis in this paper without usage-polymorphic recur- 
sion we have developed an algorithm based on non-recursive constraint abstrac- 
tions with a worst case complexity of 0{n * m * t^) where n is the size of the 
untyped lambda lifted version of the program, m is the size of the type of the 
largest set of (properly) mutually recursive definitions and t is the size of the 
largest instantiated type jSveOOJ . Since m and t typically grow slowly or not at 
all with program size we expect that algorithm to scale up well in practice. 

4 Related Work 

There is a rich literature on analyses which aims at avoiding updates. See |Gus99| 
for a thorough overview. This work especially lends ideas from the type based 
approach by Turner, Wadler and Mossin |TWM95| , and its followups by Gus- 
tavsson |Gus98l and Wansbrough and Peyton Jones IWPJ99I . Bounded poly- 
morphism was proposed by Turner, Wadler and Mossin |TWM9^ and the idea 
to use subtyping in usage analysis originates from the work by Faxen |Fa,x95] 
(the subtyping in his flow analysis and the directed edges in the post processing 
achieves the same effect as the subtyping in this paper) although it was inde- 
pendently proposed by Gustavsson |Gus98| and Wansbrough and Peyton Jones 
[IWP.T99J . 

The analysis which seems to be closest in expressive power to ours is an anal- 
ysis by Faxen based on an undecidable type based flow analysis (Fax97j . Due to 
the undecidable nature of the analysis his inference algorithm is not complete 
with respect to the type system. The algorithm is parametrised by a notion of 
finite name supply and the larger name-supply the better the algorithm approx- 
imates the type system. The exact relationship between the different degrees of 
approximations computed by his algorithm and our type system is not clear to 
us. 



A Usage Analysis with Bounded Usage Polymorphism and Subtyping 155 



The aim of this work is to make usage analysis scale up for large programs 
and in that respect it is most closely related to recent work by Wansbrough and 
Peyton Jones |WPJ00| . They have also observed that usage polymorphism is 
crucial for the accuracy of the analysis of large programs but they side-step the 
difficulties associated with bounded polymorphism. Instead they have a simple 
usage polymorphism where the quantified variables may not be constrained. 
This is achieved by an algorithm which eliminates inequality constraints prior to 
quantification by unifying constrained variables. The drawback of their approach 
is that as they refrain from using bounded polymorphism, they get an analysis 
which is rather inaccurate when it comes to data structures. Consider for example 
the following program fragment. 

. . . map square (fromto 1 100) . . . 

The spine of the list produced by fromto is consumed linearly by map but a type 
system with their simple usage polymorphism cannot discover it. The reason 
being that in a system with simple usage polymorphism the usage of the spine 
must be unified with the usage of the elements and in this case the elements are 
used more than once. In our system with bounded polymorphism the usage of the 
spine and the elements need only to constrain each other through an inequality 
constraint so we can deduce that the spine is used linearly although the elements 
are not. We believe that this situation is common enough in practice to have a 
significant effect on the accuracy of the analysis. 

That the number of constraints explodes is a problem also for other type 
based program analyses with bounded polymorphism. In that respect our work 
is most closely related to the work by Faxen |Fax95] . Mossin [Mos97j and Rehof 
and Fanhdrich [RFOl] . Faxen and Mossin present inference algorithms for type 
based flow analyses which simplifies constraint sets to smaller but equivalent 
constraint sets. In their recent work on type based flow analysis Rehof and 
Fanhdrich uses instantiation constraints to represent constraints compactly and 
thus instantiation constraints plays a role similar to our constraint abstractions. 

5 Conclusions and Future Work 

We have presented a powerful and accurate type system for usage analysis with 
bounded usage polymorphism and subtyping. A key contribution is a new ex- 
pressive form of constraints which allows constraints to be represented compactly 
through calls to constraint abstractions. In a companion paper jGSOlj we show 
how to efficiently compute a least solution to constraints with constraint ab- 
stractions and we use this technique to obtain an O(n^) inference algorithm for 
our usage analysis, where n is the size of the explicitly typed program. 
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Abstract. Implicit surfaces are defined by a real valued function. They 
can easily be defined and manipulated and have therefore gained great 
popularity in computer graphics. This paper presents a purely functional 
implementation of a well known algorithm to polygonize implicit sur- 
faces, based on spatial partitioning by means of octrees. While conven- 
tional implementations are laden with practical issues, our implementa- 
tion in Clean is straightforward, implements the algorithm very concisely 
and makes essential use of lazy evaluation. 

Further we present two enhancements to this basic algorithm: Introduc- 
ing a memo function greatly improves time efficiency. The appearance of 
a visualized implicit surface can be greatly enhanced by providing nor- 
mal vector information. For calculating normal vectors we adopt a lazy 
implementation of automatic differentiation. 



1 Introduction 

An implicit surface is given by the set of zeros of the underlying function, the 
so called implicit function. Implicit surfaces have many properties that make 
them attractive to model geometric objects, which is an important task in areas 
like computer graphics or animation. Implicit surfaces can be defined in a very 
concise way, transformed and manipulated easily. 

We visualize implicit surfaces by approximating the actual surface with poly- 
gons. We refer to the process of finding this approximation as generating or 
polygonizing an implicit surface. 

In order to find an appropriate number of zeros, we fix the domain we are 
interested in and partition it regularly using an octree, which means the (re- 
cursive) partition of a cube into eight similar subcubes. Partitioning this way 
is continued recursively for all cubes that intersect with the surface, till a pre- 
scribed depth is reached. Then the zeros on the edges of a cube are calculated 
and connected to polygons. 

We implement this algorithm in a purely functional way in the pure and 
lazy functional language Clean m- We obtain code that is much shorter than a 
comparable public domain implementation |3] in C. Due to lazy evaluation the 
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space consumption of the program is minimal, but execution time of the Clean 
code is higher than that of the C code. 

We claim that our program can easily be understood and be changed and 
demonstrate this by adding two improvements to the basic version of our Clean 
implementation: We improve the run time behavior dramatically by introducing 
a memo function. This technique avoids evaluating the implicit function more 
than once at the same point in space. Further we adopt a lazy implementation 
of automatic differentiation [202] to calculate normal vectors and add it to our 
code. Normal vector information can greatly enhance the appearance of objects 
for certain visualization tools. 

The remainder of the paper is organized as follows: In the following section 
we briefly compare the three basic ways to define surfaces in three dimensional 
space and discuss their properties. Then we introduce the octree algorithm, which 
is followed by our Clean implementation in section four. Section five discusses 
efficiency issues, followed by a comparison of our implementation and the public 
domain C implementation. The next section sketches the concept of automatic 
differentiation and how it can be used for our purposes. The final section contains 
concluding remarks. 

The source code of the Clean program is available for public use [I3- 

2 Surfaces in Three Dimensional Space 

There are basically three ways to calculate a two dimensional surface in three 
dimensional Euclidean space: explicit, parametric and implicit. We choose the 
unit sphere as our example. 

— Explicit: z = ±y^l — cc,y G IR 

This mapping defines a sphere explicitly. It is not a function, as it returns 
two, one or no result depending on the input. For each chosen a: G IR and 
2 / G IR we calculate the z coordinate(s). For > 1 there is no result, 

for + 2 /^ = 1 we have one result z = 0, and for < 1 we obtain two 

results. 

— Parametric: x = cos(j)sm6, y = siiupsinO, z = cos 9; (j),d gM. 

A parametric definition of the sphere is given by three trigonometric func- 
tions. For each chosen ^,6* G IR we calculate x,y and z respectively. 

— Implicit: F{x, y,z)=x'^ + y'^ + z'^ — l 

Finally the unit sphere can be implicitly defined by only one function F : 
IR^ — >■ IR, where the surface is the set of coordinates x,y,z G IR, for which 
holds F{x, y, z) = 0. 

The implicit definition is the most compact and uniform way to define the 
unit sphere. Here the explicit definition can easily be derived from the implicit 
one. In general this is a very difficult task. 

Besides the conciseness there are many more possibilities and advantages 
when employing implicit functions to the area of geometric modelling m- Here 
we will only look at a particularly appealing manipulation, the blending of two 
or more implicit surfaces. 



160 



Thorsten H.-G. Zorner et al. 



For example a blend of two intersecting spheres can be described very easily. 
We use boldface to denote a vector x := (x, y, z). Let F{x) define one sphere and 
G(x) another. We further assume that both F and G are negative at interior 
points of the respective spheres, which is no important restriction. Then the 
blend of the two is readily described by 

iL(x) := min { F"(x), G(x)} . 




Fig. 1. Two blended spheres cut open in order to show, that blending is neither inter- 
section, nor a simulation of soap films 

The resulting surface is then the set of points x for which H (x) = 0. The same 
method, which we generate implicit surfaces with for F and G, can be applied to 
F[ without any additional effort like calculating the intersection (explicit case) 
or adjusting parameters (parametric case). 

It might be a problem that a blending function using min is no longer smooth 
(continuously differentiable). If smoothness is desired there exist other ways 
of blending surfaces that result in a continuously differentiable function. The 
following function, which we use in our examples, yields a smooth function, but 
is of course more expensive: 

iL(x) := i (f(x) -f G(x) - ,/FW+Wf) 

3 The Octree Approach 

In this section we will discuss an algorithm for polygonizing implicit surfaces 
based on spatial decomposition, as discussed by Bloomenthal |2]. 

The nodes building the polygons are zeros of the implicit function. We need 
to find an appropriate number of zeros: Not too few because they do not approx- 
imate the surface well and not too many, as they would flood any visualization 
tool with redundant or invisible information (think of polygons below pixel size) . 

The algorithm is based on the partitioning of space in cubes, which are first 
recursively refined in areas close to the surface and finally get polygons inscribed 
into. The data structure that models the partitioning is a tree structure where 
each parent node possesses eight child nodes. Therefore it is called octree. 

We describe the basic idea in more detail: Given an implicit function F, we 
know that F(x) = 0 holds for all points x = (x, y, z) on the surface. For all 
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other points z, F{z) is either positive or negative, depending on what side of 
the surface z is. Let us assume two points y and z for which F has different 
signs F{y) > 0 and F{z) < 0. Then on a straight line between them there is (at 
least) one point x for which F(x) is zero. Finding this point on a straight line 
is a simple one dimensional problem and can be solved by means of bisection. 

We have reduced the problem to finding suitable positive and negative points, 
close to the actual surface, and to build polygons. This is accomplished by par- 
titioning interesting parts of the three dimensional space in a regular fashion. 

We start off with a cubic domain, which contains the area that we are inter- 
ested in. Edges of this and all subsequent cubes are parallel to the coordinate 
axes. We cut the initial cube into eight subcubes of equal size and repeat to do 
so recursively for interesting cubes. 

Cubes are referred to as interesting when they are intersected by the sur- 
face. In order to test this we evaluate the function at all eight nodes of a cube. 
Interesting cubes are the ones where the sign of the function value at least at 
one corner node differs from the others. In other words a cube with all posi- 
tive/negative nodes lies completely on one side of the surface - and is therefore 
uninteresting for our purposes. 

This criterion is obviously not suitable to determine accurately if surface and 
cube in question are disjoint. There can always be a small detail in the surface, 
for which the checks at the corners fail. Additional checks will detect intersection 
in many more cases, but not all. A remedy to this shortcoming is to increase the 
prescribed depth of the tree. 

By construction all interesting cubes at the maximal depth have the same 
side length. Assuming the side length of the initial cube to be Iq, then the side 
of a cube at depth k has length Ik = . When this depth is reached, we 

calculate the zeros on the edges of the cube between nodes of opposite signs, 
which will be the nodes of the polygons to be drawn. 

For each interesting cube at the desired depth we inscribe one or more poly- 
gons. Unfortunately there are configurations of positive and negative nodes on 
a cube where this task is ambiguous, as Figure 2 shows. 




Fig. 2. Ambiguous node configuration on a cube. The node in the front left and the 
node in the back right have a different sign from the others 



However it is unambiguous on a tetrahedron. All 2^ = 16 possible configu- 
rations of positive and negative signs on the four nodes of a tetrahedron, boil 
down to just three basic cases, shown in Figure 3. 
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Fig. 3. The three basic node configurations on a tetrahedron 

For each tetrahedron we obtain either nothing (all four nodes of the same 
sign), one triangle (one node differs from the others) or one quadrilateral (two 
signs differ from the other two). 

We cut each interesting cube into six tetrahedra. This allows for a proper 
visualization, as all neighboring polygons share entire edges rather than just 
nodes with their neighbors. 




Fig. 4. Cube cut into six tetrahedra 



4 The Clean Implementation 

We assume basic knowledge of functional languages and in particular on the 
pure and lazy functional language Clean m- 

As we work in Euclidean space we base all geometric information on three 
dimensional vectors. We prefer an algebraic data type instead of an array, for 
the ease of access. There are only three elements within the vector, one for each 
coordinate direction, of type Real. 

:: Vectors = Vectors !Real IReal IReal 

Since the edges of each cube are parallel to a respective coordinate axis, two 
vectors in three dimensional space suffice to define it. One vector contains the 
minimal values of each direction, the other one the maximal values. A pair of 
such vectors forms a cube. 

Cube := ( Vectors, Vectors) 

We define a record to contain the tree structure of our octree and the ge- 
ometric information of the current cube. Combinations of three first letters as 
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* y north 





Fig. 5. Left hand side shows a cube being defined by two nodes. All edges are parallel 
to the coordinate axis. Right hand side gives the names of the six directions 



field name indicate which subtree the field refers to. For instance wnf refers to 
west, north, front. 

: : Octree = 

{ cube : : Cube 

, wnf :: Octree, enf :: Octree, wsf :: Octree, esf Octree 

, wnb :: Octree, enb :: Octree, wsb :: Octree, esb Octree 

} 

The implementation consists basically of two functions: a generating function, 
that generates the octree and a consuming function, that consumes the octree 
and produces the polygons. It exploits the fundamental advantage of functional 
programming as pointed out by Hughes |S]: the possibility of glueing programs 
together. 

The generating function defines a complete and unbounded octree. 

genOctree : : Cube -> Octree 
genOctree current_cube =: 

( min =: Vectors west south back 

, max =: Vectors east north front) 

# (Vectors mx my mz) = (min + max) /. 2.0 
= { cube = current_cube 

, wnf = genOctree ( Vectors west my mz. Vectors mx north front) 

, enf = genOctree ( Vectors mx my mz, max) 

, wsf = genOctree ( Vectors west south mz. Vectors mx my front) 

, esf = genOctree ( Vectors mx south mz. Vectors east my front) 

, wnb = genOctree ( Vectors west my back. Vectors mx north mz) 

, enb = genOctree ( Vectors mx my back. Vectors east north mz) 

, wsb = genOctree ( min. Vectors mx my mz) 

, esb = genOctree ( Vectors mx south back. Vectors east my mz)} 

Due to lazy evaluation this potentially infinite data structure is only eval- 
uated as far as needed. We fix three global constants depthMAX, depthBS, 
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depthMIN. depthMAX prescribes the maximal depth of the octree, depthBS the 
number of bisection steps. A minimal depth of the octree is given by depthMIN 
in order to to prevent the premature termination of the program for surfaces 
with many gaps. 

The cubes in the octree are used as a basis to generate polygons describing 
the surface defined by the function. Now we define a function f of the type 

f : : Vectors -> Real 

and a macro to check for its sign 

Fx:==fx>0.0 

We want to generate polygons, which we described by a list of vectors. 

:: Polygon :== [Vectors] 

The consuming function takes a counter for the current depth, the defined 
octree and a list for the output by continuation. 

consmneOctree : : Int Octree [Polygon] -> [Polygon] 

The function consumeOctree distinguishes the following cases. 

— If the minimum depth has been reached, and the function has the same sign 
at all corner points, which means the current cube is not interesting, nothing 
has to be drawn inside this cube. The current cube is dropped. 

— If the maximum depth is reached, the current cube is cut into six tetrahedra, 
in which polygons are inscribed into. This is done by the function tetra. 

— Otherwise we apply the function consmneOctree to all subcubes of the cur- 
rent cube by continuation. 

consumeOctree n t =: { cube = 

( Vectors west south back, Vectors east north front)} cont 
I n>depthMIN && allTheSame [F nwnf , F nenf , F nwsf , F nesf 
,F nwnb, F nenb, F nwsb, F nesb] = cont 
I n>=depthMAX = 

( tetra nwsf nesf nwsb nwnf 
( tetra nenf nwnf nwnb nwsb 
( tetra nesf nenf nwsb nwnf 
( tetra nesb nwsb nesf nenf 
( tetra nenb nenf nwnb nesb 
( tetra nwnb nenf nwsb nesb cont)))))) 

#! n = n+1 
I otherwise = 

( consumeOctree n t.wnf ( consumeOctree n t.enf 

( consumeOctree n t.wsf ( consumeOctree n t.esf 

( consumeOctree n t.wnb ( consumeOctree n t.enb 

( consumeOctree n t.wsb ( consumeOctree n t.esb 
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cont)))))))) 

where 

allTheSame xs = and xs I I not (or xs) 
nwnf = Vectors west north front 
nenf = Vectors east north front 
nwsf = Vectors west south front 
nesf = Vectors east south front 
nwnb = Vectors west north back 
nenb = Vectors east north back 
nwsb = Vectors west south back 
nesb = Vectors east south back 

The decision if a polygon is inscribed into a tetrahedron (and if so which 
polygon) is made by a case distinction on the signs of the function at the four 
corners of the tetrahedron. 

If a polygon is inscribed into a tetrahedron the nodes spanning this polygon 
are the zeros of the implicit function on according edges of the tetrahedron. The 
zeros of the function on these edges are approximated using a simple bisection 
algorithm along the edge. For some visualization tools the orientation of the 
nodes of a polygon matters in which they are connected, so we maintain a right 
handed system. 

tetra : : Vectors Vectors Vectors Vectors [Polygon] -> [Polygon] 
tetra vl v2 vS v4 cont 
I pi I p2 I pS I p4 = cont 

= [[ zerol4, zeroS4, zero24] : cont] 

I p4 = [[ zerolS, zero2S, zeroS4] : cont] 

= [[ zerol4, zerolS, zero2S, zero24] : cont] 

I pS I p4 = [[ zerol2, zero24, zero2S] : cont] 

= [[ zerol2, zerol4, zeroS4, zero2S] : cont] 

I p4 = [ [ zerol2, zero24, zeroS4, zerolS] : cont] 

= [[ zerol2, zerol4, zerolS] : cont] 

I p2 I pS I p4 = [ [ zerol2, zerolS, zerol4] : cont] 

= [[ zerol2, zerolS, zeroS4, zero24] ; cont] 

I p4 = [ [ zerol2, zero2S, zeroS4, zerol4] ; cont] 

= [[ zerol2, zero2S, zero24] : cont] 

I pS I p4 = [[ zerol4, zero24, zero2S, zerolS] ; cont] 

= [[ zerolS, zeroS4, zero2S] : cont] 

I p4 = [[ zerol4, zero24, zeroS4] : cont] 

= cont 

where 

pi = F vl 
p2 = F v2 
pS = F vS 
p4 = F v4 

zerol2 = bisection depthBS vl v2 
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zerolS = bisection depthBS vl v3 
zerol4 = bisection depthBS vl v4 
zero23 = bisection depthBS v2 v3 
zero24 = bisection depthBS v2 v4 
zero34 = bisection depthBS v3 v4 

bisection : : Vectors Vectors -> Vectors 
bisection depth r 1 

I depth == 0 = mid 

I f 1 == f mid = bisection (depth-1) mid r 
I otherwise = bisection (depth-1) 1 mid 

where 

mid = (1 + r) /. 2.0 



This is all we need to determine polygons in interesting cubes. It remains the 
task of visualization. 

We have chosen VRML [T] to draw the polygons, which stands for Virtual 
Reality Modelling Language. VRML has a number of advantages: It is a stan- 
dardized description language for three dimensional objects. There exist VRML 
browsers on many platforms, some are plug-ins to common web browsers. Some 
VRML browsers let the user fly through the depicted three dimensional object, 
allowing him to explore them well. Normal vector information can be added to 
get a smoother picture of the object, a feature that we will exploit in section six. 

5 Efficiency Issues 

We will now discuss the time and space efficiency of the Clean implementation. 
The implementation above only uses a couple of kilobytes for execution and runs 
pretty fast. We begin with some fairly simple optimizations. Later on we discuss 
how once computed values of the implicit function can be reused. At the end of 
this section we will measure the effects of these optimizations. 



Simple Optimizations 

Some small changes are introduced to enhance the efficiency. The representation 
of the cube is changed to a record that contains all eight corners. 

: : Cube = 

{ enf :: Real, wnf :: Real, esf :: Real, wsf :: Real 
, enb : : Real, wnb : : Real, esb : : Real, wsb : : Real 
} 

This is convenient since all corners of the cube are eventually needed. The 
generation of octrees is updated accordingly. 
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Some superfluous packing and unpacking of values in data types is prevented 
by passing three coordinates separately to the implicit function instead of packed 
into a single vector. The type of the implicit function becomes: 

f !Real !Real !Real -> Real 

instead of: 

f : : Vectors -> Real 
Memoization 

The implicit function is repeatedly calculated for a large number of points. The 
point in the middle of a cube for instance is a corner point of its eight subcubes. 
Moreover it is a corner of one of the subcubes of each subcube, and so on. It 
seems worthwhile to share this computation using some form of memoization 
||5|12|. We could further try to share the approximated zeros on neighboring 
edges of a tetrahedron, but we will not consider this here. Clean has no built-in 
memoization mechanism. So sharing computation has to be indicated explicitly. 
There are two problems concerning sharing. 

Firstly there is a very large number of potential points where this function 
can be evaluated. For an octree depth of n, the potential number of points is 2" 
for each dimension. That means that the total amount of possible evaluations 
is (2")^. Even for a moderate depth of an octree, say 4 or 5, storing function 
values in an array of this size is not feasible. It simply consumes too much space. 
Moreover, the value of the implicit function is computed only for a small fraction 
of all possible points of an octree. The shape of the implicit surface determines 
at which points the function has to be evaluated. 

The second problem is that the arguments of the implicit function are of 
type Real. Real numbers however are only approximated on ordinary computer 
hardware, and therefore rounding errors will occur. We have to make sure that 
the coordinates of a shared point are always the same, regardless which cube 
refers to it. 

Nevertheless it is possible to implement an elegant and easy to use memo- 
ization mechanism for this situation. There are two key ideas which we need to 
construct the memoization function: rational numbers and a tailor-made data 
structure. 

Rational Numbers 

First we use rational numbers like | instead of the corresponding real numbers. 
Our rational numbers are based on integers and do not incur troubles with 
rounding errors. All points inside the outermost cube are addressed by rational 
numbers between zero and one. The maximum depth of evaluation of the octree 
is depthMAX+depthBS+1. This implies that can be used 

as the (fixed) denominator of the rational numbers used. On standard hardware 
depthMAX+depthBS is limited to 30 in this representation, which is certainly 
sufflcient for practical applications. 
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: : Rat : == Int 

:: RatVec = Vectors !Rat !Rat !Rat 
den : : Int 

den =: 1 << (depthMAX+depthBS+1) // bitshift instead of 

// 2 ~ (depthMAX+depthBS+1) 

We do not want to change the definition of the implicit function, which 
accepts a vector as argument. Hence we need to define a conversion from the 
rational numbers used inside the octree to a vector expected by the implicit 
function: 

RatToReal :: !Rat -> Real 
RatToReal num = toReal num/toReal den 

toRealx x :== 2.0 * RatToReal x - 1.0 

toRealy y :== 2.0 * RatToReal y - 1.0 

toRealz z :== 2.0 * RatToReal z - 1.0 

The generation of polygons has to be changed slightly. The type Vectors in 
the octree is replaced by RatVec and the function f is altered to: 

F : : !Rat !Rat !Rat -> Real 

F X y z = f (toRealx x) (toRealy y) (toRealz z) 



Binary Trees to Implement Memoization 

The second idea for memoization is the fact that the implicit function is not 
evaluated at random points. The evaluation will follow the octree. The implicit 
function is not evaluated at all points of the octree, but all points where the 
function is evaluated are part of the octree. The data type used for the memo- 
ization reflects the octree approach. The outermost cube is treated specially. All 
inner points are stored in a, potentially infinite, binary tree. 

In order to address the points in the cube uniquely we will use nested one- 
dimensional trees rather than a straight three dimensional tree. 

: : Memo t = { zero : : t, btwn : : MemoT t, one : : t} 

: : MemoT t = { smll : : MemoT t , half ; : t , grtr : : MemoT t} 

We use lazy evaluation again to construct only the necessary parts of these 
trees. The following function generates memo trees for a given function and will 
find a needed value in such a tree. 

genMemo ; : (Rat -> t) -> Memo t 
genMemo f = { zero = f 0 

, btwn = genMemoT f (den»l) (den>>2) 
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, one = f den 

} 

genMemoT : : (Rat -> t) Int Int -> MemoT t 
genMemoT f n d 

#! d2 = d»l // d2 = d/2 

= { smll = genMemoT f (n-d) d2 

, half = f n 

, grtr = genMemoT f (n+d) d2 

} 

lookup :: ! (Memo t) !Rat -> t 
lookup memo num 

I num==0 = memo . zero 
I num==den = memo . one 

I num<den = lookup! memo.btwn num (den>>l) 

lookup! :: ! (MemoT t) !Rat lint -> t 
lookup! memot num d 

I num<d = lookup! memot. smll num (d>>l) 

I num==d = memot . half 

= lookup! memot. grtr (num-d) (d>>l) 

We have to change the function F again to implement memoization. A shared 
local data structure of type Memo (Memo (Memo Bool)) is defined, which con- 
tains the required values. If a function value is needed, it is retrieved from this 
data structure. 

F : : IRatVec -> Bool 

F (Vectors x y z) = lookup (lookup (lookup memo_f x) y) z 
memo_f :: Memo (Memo (Memo Bool)) 

memo_f =; genMemo (\x -> genMemo (\y -> genMemo (\z -> 

isPos (figO (toRealx x) (toRealy y) (toRealz z))))) 



Measurements 

In order to determine the effect of these optimizations we compare the execution 
time of four different versions of the program. The first version is the original 
implementation outlined in section four. The second version incorporates the 
simple optimizations of the first subsection. In the third version we have replaced 
the type Real by Rat to compute the corner points of the cubes in the octree. 
The final version uses the data structure Memo for the memoization of values of 
the implicit function. 

We compare the execution time for three different examples: The first exam- 
ple is a blend of two spheres, as shown in Figure 1 and 7. The second example 
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of medium complexity is a blend of three tori and three cylinders, shown in Fig- 
ure 8. The third and most complicated example is a blend of 27 cylinders and a 
sphere and is depicted in Figure 6. 




Fig. 6. Complicated example: blend of 27 cylinders and a sphere, depth = 5 



Table 1 lists the run-time behavior of the Clean code. We chose depthMAX=4 
and depthBS=4. The listed execution time includes writing the generated poly- 
gons to a file, but excludes the generating of VRML output. All measurements 
were done on a 266MHz PC running Windows 95. The programs had 40MB of 
heap and 1MB of stack. The executable was generated by version 1.3.3 of the 
Clean Compiler. 



Table 1. Run-time behavior of the Clean implementation: Execution time (ex), garbage 
collection time (gc), total execution time (tot) all given in seconds 



Figure Polygons Original Improved Using Rat Rat and Memo 
ex gc tot ex gc tot ex gc tot ex gc tot 

simple 2952 1.49 0.17 1.66 0.24 0.02 0.26 0.22 0.02 0.24 0.37 0.16 0.53 

medium 5572 9.54 1.54 11.1 1.04 0.04 1.08 0.82 0.03 0.85 0.70 0.17 0.87 

complex 12025 100 20.8 121 17.1 2.84 19.9 13.4 2.51 15.9 3.08 1.27 4.35 



From these figures we conclude that the simple modification that prevent 
packing and unpacking Reals in a Vectors speeds up the program by almost an 
order of magnitude. This is not surprising since the implicit function is evaluated 
very often. 

The introduction of rational numbers within the octree incurs some overhead, 
i. e. the rational numbers must be transformed to Real before the implicit func- 
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tion can be applied. Apparently this overhead is outweighed by the more efficient 
handling of integers. The execution time decreases in spite of the overhead. 

The introduction of memoization exhibits a more subtle behavior. For the 
simplest implicit function the execution time doubles. Apparently it is more 
efficient to recompute such a simple function than looking up the function val- 
ues in the Memo data structure. For the medium example (three cylinders and 
three tori) there is almost no difference between recomputing the function and 
memoization. For the most complex implicit function memoization increases the 
efficiency by a factor four. For simple implicit functions the introduction of mem- 
oization increases the execution time slightly, but for complex implicit functions 
memoization reduces the execution time significantly. Hence, we consider the 
introduction of memoization an improvement. 

In order to get an impression of the absolute speed of our implementation we 
compare it with the public domain implementation of a related algorithm in C 
P]. The comparison gives only an indication of the relative speed since there are 
a number of significant differences. The C implementation requires a small cube 
near the surface of the implicit function as starting point. From this starting 
point a set of equal sized cubes containing the surface is generated. There are no 
octrees involved. For each of these cubes polygons are generated by dividing the 
cube into six tetrahedra. The C implementation generates two triangles instead 
of a quadrilateral. Moreover, the C implementation also reuses computed zeros 
on the edges shared by neighboring tetrahedra. For the medium example, the 
C program generates 8028 triangles in 0.7 seconds. Despite all differences this 
corresponds very well to the execution time of our Clean implementation (0.8 
seconds, see Table 1). We conclude that our program performs pretty well. 



6 Adding Normal Vector Information 

Polygonizing an implicit surface yields polygons, that are spanned by nodes on 
the surface. The edges of these polygons are most likely disjoint with the surface, 
as they just approximate it, which may result in artificially sharp features of the 
surface. 

A remedy to that is offered by many graphical engines, if the user can provide 
normal vectors on nodes. While shading the surface, the engine uses normal 
vectors to interpolate the area around edges to give the visual impression of a 
smooth surface. As mentioned earlier we chose VRML as output format, as it 
also supports normal vector information. 

For implicit surfaces the normal vector at a given point on the surface is just 
the gradient at this very point. The gradient is the column vector built by all 
first partial derivatives. Let F define an implicit surface, then the normal vector 
n at (a;, y, z) is 



n{x,y,z) = \7F{x,y,z) 



Fx{x,y,z) 

Fy{x,y,z) 

Fz{x,y,z) 
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There are basically three ways to compute derivatives of a function, i. e. 
numerically, symbolically, and automatically. 

Numerical differentiation usually approximates the derivative using the defi- 
nition of the differential quotient with a step size in the denominator. The smaller 
the step size the more accurate the result will get. However a step size, which 
is chosen too small might lead to huge roundoff errors and meaningless results. 
The C implementation accompanying calculates normal vectors numerically. 

By symbolical differentiation one usually obtains a function, which can then 
be evaluated at the points needed. However symbolic differentiation can be a very 
intricate task. Numerical and automatic differentiation only yield the derivative 
at a single point, but are much easier to calculate. 

For the Clean implementation we have adopted automatic differentiation, 
which has already been employed successfully in a functional context |7|SJ . The 
method calculates the derivative at a given point at machine precision. For an 
introduction to automatic differentiation we refer to Rail, Corliss [I3j, for a 
comprehensive treatment to Griewank [3j. 

Automatic differentiation can be coded very elegantly in a pure functional 
language using operator overloading and lazy evaluation. 

In order to automatically differentiate a given function, we shift the func- 
tion from the real domain to a differential domain. This is done by replacing 
each subexpression p by an infinite sequence that contains the subexpression 
and all its derivatives. We refer to it as a differential object: [p,p',p",p"', .•■]■ 
For a constant subexpression c the differential object contains almost all zeros 
[c, 0, 0, 0, ...], as the derivative of a constant is zero. 

We model the differential objects by the algebraic data type Dif f . 

: : Diff a = Zero I D a (Diff a) 

If the tail of a differential object contains only zeros, we abbreviate it by 
using the Zero constructor. For instance a constant c is modeled by D c Zero. 
For a simple variable x, the first derivative is one and all subsequent derivatives 
are zero. The representation is therefore D x (D one Zero). 

All overloaded operators that appear in the implicit function are instantiated 
for differential objects. These instances contain all the necessary information of 
differential calculus. Addition and multiplication on differential objects becomes: 

instcuice + (Diff a) I + a 
where 

(+) Zero g 

(+) f Zero 

(+) (D X xs) (D y ys) 

instcUice * (Diff a) I *, + a 

where 

(*) Zero _ = Zero 

(*) _ Zero = Zero 

(*) f=:(D X xs) g=:(D y ys) = D (x*y) (xs*g + f*ys) 



= g 
= f 

= D (x+y) (xs+ys) 
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As an example we evaluate the function hx = x*xata; = 3. The latter 
is modelled by (D 3.0 (D 1.0 Zero) ). The application of h to (D 3.0 (D 1.0 
Zero)) yields D 9.0 (D 6.0 (D 2.0 Zero)), which corresponds to the value 
of h and its first, and second derivative at x = 3. 

Due to lazy evaluation we only calculate the derivatives really needed. For 
the first derivative we have to look no further than the second element of the 
resulting differential object. 

After the polygon generation we apply automatic differentiation to all the 
generated nodes. The overhead to calculate normal vectors is thus proportional 
to the number of nodes generated. We lift a node to the differential domain and 
apply an instance Dif f Real of the implicit function to it. Finally the required 
normal vector is extracted from the differential object. 




Fig. 7. Simple example: blend of two spheres, depth = 5. Right hand side with normal 
vectors 



7 Related Work 

There are two papers directly related our work: Karczmarczuk jS] advocates the 
use and advantages of implicit surfaces in general and in a functional setting, 
however without octrees. 

O’Donnel m also gives a functional formulation of a traditionally imperative 
algorithm from computer graphics. A framework of the hierarchical radiosity 
algorithm (a two dimensional problem) is coded in Haskell, utilizing a forest of 
quadtrees of unbounded depth, while low level calculation is done in C. 

8 Conclusion 

Implicit functions are a convenient way to specify and manipulate surfaces in 
computer graphics. The octree approach, which determines the set of polygons 
that approximates the surface of an implicit function, can be implemented very 
concisely in a lazy functional programming language. 
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Fig. 8. Medium example: blend of three tori and three cylinders, depth = 5. Right 
hand side with normal vectors 



This implementation relies on the use of infinite data structures, lazy evalua- 
tion, overloading and the composition of program fragments. Due to the succinct 
implementation of the octree approach the algorithm becomes easily comprehen- 
sible and encourages optimization and extensions. 

In this paper we have demonstrated this by introducing memoization to im- 
prove efficiency. Through this optimization the resulting functional implementa- 
tion is almost as efficient as related algorithms implemented in C. 

The second extension of the algorithm is automatic differentiation. Using au- 
tomatic differentiation we calculate normal vectors at the nodes of the polygons 
approximating the implicit surface. This information is used by visualization 
tools to let the surface appear much smoother. 

Our paper shows that a lazy functional programming language like Clean 
is an outstanding tool for the implementation of this algorithm. Due to typical 
properties of lazy functional programming languages like infinite date structures, 
lazy evaluation pattern matching and the composition of program fragments the 
implementation is clear and flexible. It appears to be a suitable starting point 
for further research. The runtime penalty, which is often used as an argument 
against functional programming languages, is very low. 
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Abstract. In this paper we compare three systems for tracing and de- 
bugging Haskell programs: Freja, Hat and Hood. We evaluate their use- 
fulness in practice by applying them to a number of moderately complex 
programs in which errors had deliberately been introdnced. We identify 
the strengths and weaknesses of each system and then form ideas on how 
the systems can be improved further. 



1 Introduction 

The lack of tools for tracing and debugging has deterred software developers from 
using functional languages m- Conventional debuggers for imperative languages 
give the user access to otherwise invisible information about a computation by 
allowing the user to step through the program computation, stop at given points 
and examine variable contents. This tracing method is unsuitable for lazy func- 
tional languages, because their evaluation order is complex, function arguments 
are usually unwieldy large unevaluated expressions and generally computation 
details do not match the user’s high-level view of functions mapping values to 
values. 

In the middle of the 1980’s a wave of research into tracing methods for lazy 
functional languages started and has been increasing since. In this paper we 
compare the tracing systems that (a) cover a large subset of a standard lazy 
functional language, namely Haskell 98 [9], (b) are publicly available and (c) are 
still actively developed. Frej£0 |7I5| is a system that creates an evaluation de- 
pendency tree as trace, a structure based on the idea of declarative/algorithmic 
debugging from the logic programming community. Hal0 [I12lllj creates a trace 
that shows the relationships between the redexes (mostly function applications) 
reduced by the computation. The most recent system, HoocflU], enables the pro- 
grammer to observe the data structures at given program points. It can basically 
be used like print statements in imperative languages, but the lazy evaluation 
order is not affected and functions can be observed as well. 

^ http://www.ida.liu.se/~henni 
^ http : //www. cs . york. ac .uk/fp/ART 
® http://www.haskell.org/hood 
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In this paper we compare Freja 1.1, Hat 1.0 and Hood July 2000 release. We 
evaluate the systems in practice by applying them to a number of moderately 
complex programs in which errors are deliberately introduced. Tracing systems 
are interactively used tools. In this paper we concentrate on the usefulness of 
the systems for the programmer. Runtime and space usage measurements are 
reported in other papers i5i6im . We do not aim for a quantitative comparison 
to crown a winner. Only with a large number of programmers could we have 
obtained statistically valid data about, for example, how long it takes to locate 
a specific error with a specific system. Even these data depend for example 
on how well the programmers are trained for a system, especially because the 
systems are rather different. Our aim is to explore the design space of tracers 
and gain insights for the future development of tracing and debugging systems. 
Our experiments highlight and sometimes even uncover previously unnoticed 
similarities and distinguishing features of the three systems. The experiments 
enable us to evaluate the usefulness of system features and lead us to new ideas 
for how the current systems can be improved or even be combined. 

The paper is structured as follows. Section [2] gives a short introduction to 
each of the three systems. Section O compares the systems with respect to their 
approach to tracing, design and implementation. Section |T] reports on our prac- 
tical experiments and the insights they gave us into the systems’ distinguishing 
properties and their usefulness. Section [5| briefly describes other systems for 
tracing and debugging. Section [6] concludes. 

2 Learn Three Systems in Three Minutes 

To give an idea about what the three tracing systems provide and how they are 
used we give a short introduction here. Because all three systems are still under 
rapid development we try to avoid details that may change soon. 

We demonstrate the use of each system with the following example prograir[f|. 



main 


= let : 


xs = [4*2, 


3+6] : : [Int] 




in 


(head xs , 


last xs) 


head 


(x:xs) 


= X 




last 


(x : xs) 


= last xs 




last 


[x] 


= X 





Note that the evaluation in Section |4] is based on experiments with far larger 
programs. 



Freja actually expects main to be of type String and the other two systems expect 
it to be of type ID (). Here we abstract from the details of input /output. 



4 
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2.1 Freja 

Freja is a compiler for a subset of Haskell 98. A debugging session consists of the 
user answering a sequence of questions. Each question concerns a reduction of a 
redex - that is, a function application - to a value. The user has to answer yes, 
if the reduction is correct with respect to his intentions, and no otherwise. In 
the end the debugger states which reduction is the cause of the observed faulty 
behaviour - that is, which function definition is incorrect. 

The first question always asks if the reduction of the function main to the 
result value of the program is correct. If the question about the reduction of 
a function application is answered with no, then the next question concerns a 
reduction for evaluating the right-hand-side of the definition of this function. 
Freja can be used rather similarly to a conventional debugger. The input no 
means “step into current function call” and the input yes means “go on to 
next function call” . If the reduction of a function application is incorrect but all 
reductions for the evaluation of the function’s right-hand-side are correct, then 
the definition of this function must be incorrect for the given arguments. 

The following is a debugging session with Freja for our example program. 
The symbol T represents an error and the symbol ? represents an expression 
that has never been evaluated and whose value hence cannot have influenced 
the computation. 

main (8, T) no 

4*2 ^ 8 yes 

head [8,?] => 8 yes 

last [8,?] => T no 

last [?] T no 

last [] T yes 

Bug located! Erroneous reduction: last [?] T 

2.2 Hat 

Hat consists of a modified version of the nhc98 Haskell compilei0 and a separate 
browser program. A program compiled for tracing executes as usual except that 
alongside the normal computation it builds a redex trail in heap and instead of 
terminating at the end it waits for the browser to connect to it. The browser 
shows the output of the program. The user selects a part of it and asks the 
browser for its parent redex. The parent redex of an expression is the redex that 
through its own reduction created the expression. Each part of the redex has 
again a parent redex which the browser shows on demand. A trail ends at the 
function (redex) main, which has no parent. Debugging with Hat works by going 
from a faulty output or error message backwards until the error is located. 

The browser has a graphical user interface which we do not discuss here. 
Basically the system is used as follows to locate the error in our example program. 
The program aborts with an error message and the browser directly shows its 

® http : //www. cs . york. ac .uk/fp/nhc98 
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parent redex: last [] . The user is surprised that the function last is ever 
called with an empty list as argument and asks the browser for the parent redex 
of last [] . The answer, last (3+6 : [] ) , makes clear that the definition of last 
is not correct for a single element list. The browser presents the redex trail as 
shown in the following figure. To demonstrate how the parent of a subexpression 
is presented (4*2 is the parent of 8), more of the redex trail is shown than is 
needed for locating the error. 

• last [] 
last (3+6 : [] ) 
last (8 : 3+6 : [] ) 

V 4*2 
..main 

The browser can also show where in the program text for example last is 
called with the argument [] in the equation for last (x;xs). 

2.3 Hood 

Hood currently is simply a Haskell library. A user annotates some expressions 
in a program with the combinator observe, which is defined in the library. 
While the program is running, information about the values of the annotated 
expressions is recorded. After program termination the user can view for each 
annotation the observed values. 

We annotate the argument of last in our example program: 

main = let xs = [4*2, 3+6] 

in (head xs , last (observe "last arg" xs)) 

When the modified program terminates it gives us the following information: 

— last arg 

The symbol _ represents an unevaluated expression. Note that the first element 
of the list xs is evaluated by the program, but not by the function last. 

To gain more insight into how the program works we observe the function 
last, including all its recursive calls: 

last = observe "last" last’ 

last’ (x:xs) = last xs 
last ’ [x] = X 

The value of the function is shown as a finite mapping of arguments to results: 

— last 

{ \ (_:_:[] ) -> throw <Exception> 

, \ (_ : [] ) “> throw <Exception> 

, \ [] -> throw <Exception> 

} 
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So last is called with an empty list. We draw the conclusion that last 
applied to the one element list caused this erroneous call, but strictly the infor- 
mation provided by Hood does not imply this. 



3 Comparison in Principle 

At first sight the three systems do not seem to have anything in common ex- 
cept the goal of aiding debugging. However, all three systems take a two phase 
approach: while the program is running, information about the computation 
process is collected. After termination of the program the collected information 
is viewed in some kind of browser. In Freja, the browser is the part that asks 
the questions, in Hat the program that lets the user view parents and in Hood 
the part that prints the observations. This approach should not be confused 
with classical post-mortem debugging where only the final state of the computa- 
tion can be viewed. Having a trace that describes aspects of a full computation 
enables new forms of exploring program behaviour and locating errors which 
should make these systems also interesting for strict functional languages or 
even non- functional languages. 

All three systems are suitable for programs that show any of the three kinds 
of possible faulty observable behaviour: wrong output, abortion with error mes- 
sage, non-termination. In the latter case the program can be interrupted and 
subsequently the trace can be viewed. 



3.1 Values and Evaluation 

All three systems are source-level tracers. They mostly show Haskell-like ex- 
pressions which are built from functions, data constructors and constants of the 
program. To improve comprehensibility, all three systems show values instead of 
arbitrary expressions as far as possible. Hood only shows values anyway. Both 
Freja and Hat show an argument in a redex not as it was passed in the actual 
computation but as a value. Only (a part of) an argument that was never eval- 
uated is shown as an unevaluated redex in Hat (3+6 in the previous example) 
whereas Freja and Hood represent it by a special symbol (? in Freja and _ in 
Hood). Freja and Hat show an expression only up to a given depth (for example 
map succ (0 : succ 0 : □) in Hat; □ represents the elided subexpression). A 
subexpression beyond that depth is only shown on demand. None of the systems 
changes the usual observable behaviour of a program. In particular, they do not 
force the evaluation of expressions that are not needed by the program. 

However, the systems differ in that Hood shows values as far evaluated as 
they are demanded in the context of the observation position whereas both Freja 
and Hat show how far values are evaluated in the whole computation, including 
the effect of sharing. Hence in the previous example Freja and Hat show the first 
element of the list argument in the first call of last as 8 whereas Hood only 
represents that element by _. 
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main =► (8, A) 

\ 

1 

1 




4*2^8 




head [8,?] ^ 8 




last [8,?] A 



T 



last [?] => A 

1 

i 

last [] => A 



Fig. 1. Evaluation dependency tree 






last. • 



last. • 

■“T 




Fig. 2. Redex trail 



3.2 Trace Structures 

In Hood a trace is a set of observations. These observations are shown in full to 
the user. In contrast, each of Freja and Hat create a single large trace structure 
for a program run. It is impossible to show such a trace in full to the user. The 
browser of each system permits the programmer to walk through the structure, 
always seeing only a small local part of the whole trace. 

Freja creates an Evaluation Dependency Tree (EDT) as trace. Each node 
of the tree is a reduction as shown in the browser. The tree is basically the 
derivation/proof tree for a call- by- value reduction with miraculous stops where 
expressions are not needed for the result. The call-by- value structure ensures 
that the tree structure reflects the program structure and that arguments are 
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maximally evaluated. Figure [T] shows the EDT for our example program of Sec- 
tion [2l The symbol _L represents the value of the error message. 

Hat creates a redex trail as trace. A redex trail is a directed graph of value 
nodes and redex nodes. Each node, except the node for main, has an arrow to 
its parent redex node. Because subexpressions of a redex may have different 
parents or may be shared, redex nodes may contain arrows to nodes of their 
subexpressions. Figure [2] shows the redex trail for our example program of Sec- 
tion [2l Dotted arrows point to subexpressions. Both dashed and solid arrows 
denote the parent relationship. (8,_L) is the result value of the computation. As 
in Freja, _L represents the value of the error message. 

The graphs of the two trace structures are laid out to stress their similarity. 
All arrows of the EDT are also present in the redex trail but point in the oppo- 
site direction. If the redex trail held information about which parent relations 
correspond to reductions (these are shown as solid arrows), then the EDT could 
be constructed from the redex trail (however, see also the next paragraph and 
Section UH about free variables). In contrast, the redex trail contains more infor- 
mation than the EDT, because it additionally links every value with its parent 
redex and describes how expressions are shared. 

The redex trail shown in Figure [2] is a simplified version of the one that is 
really created by Hat. The real redex trail has an additional node xs with parent 
main and children 4*2, 3 + 6, the two • : • nodes and [] . That is, the redex trail 
also records the reduction of the let expression. The whole let expression is a 
redex, but in the redex trail it is represented by the defined variable xs. Similarly 
a node xs => [8 , ?] that records the reduction of the let expression could be 
added to the EDT. So recording a let reduction is an option for both the 
EDT and the redex trail and the implementors of Freja and Hat made different 
decisions with respect to this option. On the one hand recording let reductions 
leads to larger traces with an unusual kind of redex. On the other hand it enables 
more fine grained tracing (cf. Section [4.31 1. 

Because Hood observations contain values as they are demanded in a given 
context, whereas both the EDT and the redex trail contain values in their most 
evaluated form, it is not possible to gain Hood observations from either the EDT 
or the redex trail. Conversely, even observing every subexpression of a program 
with Hood would not enable us to construct an EDT or redex trail, because 
there is no information about the relations between the observations. 



3.3 Implementation and Portablility 

Each system consists of two parts, the browser and a part for the generation of 
the trace. We will discuss the browsers in Section |31 

The developers of the three systems made different choices about the level at 
which they implemented the creation of the trace. In Freja the trace is created in 
the heap directly by modified instructions of the abstract graph reduction ma- 
chine. Hat transforms the original Haskell program into another Haskell program. 
Running the compiled transformed program yields the redex trail in addition to 
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the normal result. Finally, in Hood the trace is created as a side effect by the 
combinator observe, which is defined in a Haskell library. 

The level of implementation has direct effects on the portability to different 
Haskell systems. Hood can be used with different Haskell systems, because the 
library only requires a few non-standard functions such as unsaf ePerf ormlO 
which are provided by every Haskell systeir0. The transformation of Hat is cur- 
rently integrated into the nhc98 compiler but could be separated. A transformed 
program uses a few non-standard unsafe functions to improve performance. Fur- 
thermore, some extensions of the Haskell run-time system are required to retain 
access to the result after termination or interruption and to connect to the 
browser. Finally, Freja is a Haskell system of its own. Adding its low-level trace 
creation mechanism to any other Haskell system would require a major rewriting 
of this system. 

3.4 Reduction of Trace Size 

In Hood the trace consists only of the observations of annotated expressions. 
Hence its size can be controlled by the choice of annotation^. In contrast, both 
Freja and Hat construct traces of the complete computation in the heap. 

To reduce the size of the trace, both Freja and Hat enable marking of func- 
tions or whole modules as trusted. The reduction of a trusted function itself is 
recorded in the trace, but not the reductions performed to evaluate the right- 
hand-side of its definition. The details of the trusting mechanisms of both sys- 
tems are non-trivial, because the evaluation of untrusted functions which are 
passed to trusted higher-order functions have to be recorded in the trace. Usu- 
ally at least the Haskell Prelude is trusted. 

To further reduce the space consumption, both Freja and Hat support the 
construction of partial traces. In Freja, first only an upper part of the EDT may 
be constructed during program execution. When the user reaches the edge of 
the constructed part of the EDT in the browser, this part is deleted and the 
whole program is re-executed, this time constructing the part of the EDT that 
can be reached next by the questions. So, except for the time delay caused by 
re-execution, the user has the impression that the whole EDT is present. 

Hat can produce partial traces by limiting the length of the redex trails. Be- 
cause a redex trail is browsed backwards, the system prunes away those redexes 
that are further than a certain length away from the live program data or out- 
put. Hat does not provide any mechanism like re-execution in Freja to recreate 
a pruned part of the redex trail. 

® The version of Hood which can handle not only terminating programs but also those 
that abort with an error message or do not terminate requires the non-standard 
exception library supplied with the Glasgow Haskell compiler. 

^ A variant of Hood allows the annotated running program to write observed events 
directly to a file, so that the trace does not need to be kept in primary memory. 
However, to obtain observations, the events in the hie need to be sorted. Hence the 
browser for displaying observations reads the complete file and thus has problems 
with large observations. 



184 



Olaf Chitil, Colin Runciman, and Malcolm Wallace 



Requiring less heap space may reduce garbage collection time, but Hat still 
spends the time for constructing the whole trace whereas Freja does not need to 
spend time on trace construction after construction of an upper part of an EDT. 



4 Evaluation of the Systems 

Differences between the systems directly raise several questions. Is it desirable 
to add a feature of one system to another system? Does an alternative design 
decision make sense? How far is a distinguishing feature inherent to a system, 
possibly determined by its implementation method or its tracing model? Because 
the design space for a tracer is huge, it is sensible to evaluate system features 
in practice early. We applied the three systems to a number of programs in 
which errors had deliberately been introduced. The errors caused all three kinds 
of faulty observable behaviour mentioned earlier: wrong output, abortion with 
error message and non-termination. 

Our evaluation experiments use the following protocol: At least two program- 
mers are involved. First the author of a correctly working program explains how 
the program works. Then one programmer secretly introduces several deliberate 
errors into the program, of a kind undetected by the compiler. Given the faulty 
program, the other programmers use a tracing system to locate and fix all the 
errors, thinking aloud and taking notes as they do so. 

All the participants are experienced Haskell programmers. 

The programs used in the experiments are of moderate complexity. The 
largest program, PsaCompiler, a compiler for a toy language, consists of 900 
lines in 13 modules and performs 20,000 reductions for the input we provided. 
The longest running program, Adjoxo, an adjudicator for noughts and crosses 
(tic tac toe), consists of only 100 lines but performs up to 830,000 reductions 
for our inputs. In our choice of programs we were restricted by the subset of 
Haskell that Freja supports. For example, Freja does not implement classes and 
unfortunately not even every Freja program is a valid Haskell program. Freja 
had been applied to a mini compiler with 16 million reductions [6] and Hat had 
been applied to a version of nhc98 with 14,000 lines and 5.2 million reductions 
and a chess end-game program with 20 million reductions El . These papers give 
performance figures but do not indicate how easy debugging programs of this 
size is. We cannot make such statements either, but our programs are definitely 
beyond toy examples and of a size often occurring in practise. Our programs also 
do not perform monadic input/output. Freja does not implement it and Hat only 
supports a few operations. It would be interesting to see if Hood’s ability to show 
the return value of an executed input/output action is sufficient in practice. 

4.1 Readability of Expressions 

In contrast to our preliminary fears that the expressions shown by the browsers 
- reductions, redexes and values - would be too large to be comprehensible, for 
our programs they are mostly of moderate size and easily readable. 
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As we will discuss in Section 14.21 the user of a tracing system not only views 
the trace but also the program. Nonetheless in Freja and Hat informative variable 
(function) names, that convey the semantics of the variable well, substantially 
reduce the need for viewing the program and thus increase the speed of the 
debugging process substantially. 



Unevaluated Expressions. Freja shows unevaluated expressions as ? and the 
undefined value as _L. This property makes expressions even shorter and more 
readable. This also holds for Hood. Only in some cases more information would 
be desirable for better orientation. In Hat the display of the unevaluated redexes 
in full sometimes obscures higher level properties, for example the length of a 
list. All in all our observations suggest that unevaluated expressions should be 
collapsed into a symbol by default but should be viewable on demand. 

Hood shows even less of a value than Freja, because it only shows the part 
demanded in a given context. Note that this amount of information would suffice 
for answering the questions of Freja. Because Hat is not based on questions, it is 
less clear if showing only demanded values would be suitable for it. Finally, the 
fact that Freja and Hat show values to the extent to which they are evaluated in 
the whole computation whereas Hood shows them to the extent to which they 
are demanded is closely linked to the respective implementations of the systems 
and thus not easily changeable. 



Functions. In Haskell, functions are first-class citizens and hence function val- 
ues may appear for example as arguments in redexes or inside data structures. 

For the representation of function values. Hood deviates from the principle 
of showing Haskell-like expressions. It shows function values as finite mappings 
from arguments to results. Because the mapping contains only expressions that 
were demanded during the computation, the representation is short in most 
cases. However, for functions that are called often and especially for higher- 
order functions the representation is unwieldy. The representation requires some 
time to get used to. In return, it permits a rather abstract, denotational view of 
program semantics which is useful for determining the correctness of part of a 
program. 

In Freja and Hat a function value is shown as a function name, a A-abstraction, 
or as a partial application of a function name or a A-abstraction. Function names 
and their partial applications are easily readable but A-abstractions are not. Both 
systems do not show a A-abstraction as it is written in the program but repre- 
sent it by a new symbol: <lcmibda#n> for a number n in Freja and (\) in Hat. 
Both systems can show the full A-abstraction on demand. However, because 
of the necessary additional step and because A-abstractions are often large ex- 
pressions, reading expressions involving A-abstractions is hard. We conjecture 
that with Freja or Hat debugging programs that make substantial use of A- 
abstractions, as commonly done for stylised abstractions such as continuation 
passing, higher-order combinators and monads, is rather difficult. Our programs 
hardly use stylised abstractions. In fact, PsaCompiler uses only named functions. 
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even in the definitions of its parser combinators, where most Haskell program- 
mers would use A-abstractions. During tracing, Freja and Hat show very readable 
expressions for PsaCompiler. 



Ftee Variables. Both A-abstractions and the definition bodies of locally defined 
functions often contain free variables. To answer a question in Freja the values of 
such free variables must be known. Hence Freja shows this information in a where 
clause. The following question from an evaluation experiment demonstrates that 
this information usually adds to the comprehensibility of a question considerably: 

tableRead 

"y" 

(Tableimp 

(newTableFunction 

where 

newindex = "x", 
newEntry = 1 , 

oldTableFunction = implTableEmpty) ) 

=> 

Just 1 

The correct answer is obviously no. 

Hat does not show the values of free variables. This information can be ob- 
tained only indirectly by following the chain of parent redexes of such a function. 
To realise that a function has free variables and to see the corresponding argu- 
ments of parent redexes it is necessary to follow links to the program source. 

In Hood an observation of a locally defined function can be misleading. The 
observation is really for a family of different functions, with different values for 
free variables. In our experiments one observation of a local function moveval is 
presented as follows 

— moveval 

{ . . . , \ 8 -> Draw, . . . } 

{ . . . , \ 8 -> Win, . . . } 

4.2 Locating an Error 

With all three systems we successfully locate all errors in our programs. For 
locating an error in our largest program we answer between 10 and 30 questions 
in Freja, look at 0 to 6 parents in Hat and add observe up to 3 times for 
Hood. The relation between these numbers is typical. However, the numbers 
cannot be compared directly to determine speed of use, because the counted 
operations are completely different. A major difference between the systems 
is the time the user has to spend thinking about what to do next, and the 
effort required to do it. For example, the time required in Hood for deciding 
where to add observe annotations, modifying the program (discussed further in 
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Section 14.41) , recompiling the program and reexecuting it is substantially higher 
than answering a question or selecting an expression for viewing its parent. 
Furthermore, the amount of data produced by a single observe annotation is 
usually substantial. 



Guidance and Strategies. Freja asks questions which the user has to answer 
whereas in both other systems the user also has to ask the right questions. Freja 
guides the user towards the error. 

Hat at least starts with the program output, an error message or the last 
evaluated redex in an interrupted program and the main operation is to choose 
a subexpression and ask for its parent. There are usually many subexpressions 
to choose from and the system never states that an error has been located at a 
given position in the program. Wrong parts in the output or wrong arguments in 
redexes are candidates for further enquiry. Nonetheless, for the less experienced 
user it is easy to get lost examining an irrelevant region of the redex trail. 

Hood gives the complete freedom to observe any value in the program. The 
initial choice of what to observe is difficult and often seems arbitrary. In general 
Hood users apply a top-down strategy in their placement of observe combina- 
tors, if the faulty behaviour does not point to any program location, for example 
when the program does not terminate. Then the questions the Hood users asks 
are similar to those asked by Freja. If, on the other hand, the position where 
the observable fault is caused can be identified, for example when the program 
aborts with an error message occurring only once in the program, then a Hood 
user tries to apply a bottom-up strategy reminiscent of Hat. 

Our programs contain several errors. Users of Hat and Hood locate the er- 
rors in the same order, because they always locate the error that causes the 
observed faulty behaviour. In contrast, the questions of Freja sometimes lead to 
the location of a different error. It is possible to tackle a specific faulty behaviour 
by answering some questions incorrectly, but that requires care. One may easily 
steer into irrelevant regions of the EDT. 



General Usability. Hat with its complex browser has the steepest learning 
curve for a new user. In contrast, the principle of questions and answers of 
Freja is easy to grasp and Hood has the advantage of using the idea of print 
statements, which are well-known from imperative languages. Hence a mode that 
would hide some features from the beginner seems desirable for Hat. 



Information Used. A Hood user has to modify the program and hence look 
at it. Sometimes just the process of searching for a good placement of observe 
reveals the error. Users of Freja and Hat, especially the former, tend to neglect 
the program. As long as the user knows the intended meaning of functions he 
can use Freja without ever looking at the program. This does however imply 
that the user does not try to follow Freja’s reasoning and to understand how the 
finally located error actually caused the observed faulty behaviour. Redexes as 
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shown by Hat are not intended to be the only source of information for locating 
an error. Viewing the program part where a redex is created gives valuable 
context information and at the end the program is needed to locate the error. 
Both Freja and Hat provide quick access to the part of the program relating 
to the current question or redex. Nonetheless, it seems worthwhile to test if 
automatically showing the relevant part of the program when a new question or 
parent is shown would improve usability. 

In contrast to the other two systems Hat also gives information about which 
expressions are shared. This information is useful in some cases, usually when 
expressions are shared unexpectedly. 

A trace of Hood is a set of observations. The trace unfortunately contains no 
information about the relations between these observations. Hence, with a few 
exceptions, we observe functions to obtain at least a relation between arguments 
and result. In particular, the representation of an observed function shows clearly 
which (part of an) argument is not demanded by the function for determining 
its result. This feature is helpful for locating errors. 



Wrong Subexpressions. Often, in the questions posed by Freja, a specific 
subexpression of a result is wrong. For example in the following program the 1 
in the second list element should be a 2. But there is no way to give Freja this 
information. We can only confirm or refute the reduction as a whole. 

trauislateStatement 

(Tableimp 

(newTableFunction 

where 

newindex = "y", 
newEntry = 2 , 

oldTableFunction = newTableFunction 
where 

newindex = "x" , 
newEntry = 1 , 

oldTableFunction = implTableEmpty) ) 

7 

(Assignment "x" (Compound (Var "x") Minus (Var "y"))) 

=> 

_Tuple_2 [Lod l,Lod 1 ,Sb,Sto 1] 4 

In contrast, the redex trail contains the parent of every subexpression. A 
Hat user seldom asks for the parent of a complete expression but usually for the 
parent of some subexpression. We believe that this is the major reason why we 
look at far less parents with Hat than we answer questions of Freja for locating 
the same error. A Hood user obviously also tries to use information about wrong 
subexpressions but it is not easy to decide where to place the next observe 
combinator. 
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Reduction of Information. In Hood, the user determines the size of the trace 
by the placement of observe combinators. It is, however, sometimes not easy to 
foresee how large an observation will be. The trusting mechanisms in Freja and 
Hat not only save space but also reduce the amount of information presented to 
the user. The ability of the Freja browser to dynamically trust a function and 
thus avoid further questions about it is useful. For Hat a corresponding feature 
seems desirable. In Freja, sometimes a question is repeated, because the same 
reduction is performed again. Hence memoisation of questions and their answers 
is desirable. It would also be useful to be able to generalise an answer, to avoid 
a series of very similar questions all requiring the same answer. 



Runtime Overhead. With respect to the time overhead caused by the creation 
of traces the low-level implementation of Freja pays off. The overhead is not no- 
ticeable. In contrast, in Hat traced computations are more than ten times slower. 
For some inputs adjoxo seems to be non-terminating but it is only slow! We ex- 
perience the same with Hood when we observe at positions that are computed 
very often and that lead to large observations. So in Hood the time overhead is 
considerable but it is only proportional to the amount of observed data. 

Compiler Messages. A helpful error message from a compiler can reduce 
the need for a tracer. If a function is called with an argument for which no 
matching equation exists, then the aborting program gives the function name if 
it was compiled with the Glasgow Haskell compileiEl but not if it was compiled 
with Freja or nhc98. However, in that case Hat directly shows the function with 
its arguments whereas Freja requires the answers to numerous questions before 
locating the error. 

4.3 Redexes and Language Constructs 

A computation does not only consist of reductions of function applications. We 
noted already in Section 13.21 for let expressions that there are other kinds of 
redexes. This aspect only concerns Freja and Hat, because Hood only shows 
values. 



CAFs. A constant applicative form (CAF) is a top-level variable of arity zero, 
in other words a top-level function without arguments. Its value is computed on 
demand and shared by its users. Both Freja and Hat take the view that a CAF 
has no parent. Hence the trace of a program in Freja is generally not a single 
EDT but a set of EDTs, an EDT for each CAF including main. These EDTs are 
sorted so that a CAF only uses those CAFs about which questions have already 
been asked and which are hence known to be free of errors. Unfortunately one of 
our experiment programs containes 35 CAFs. We have to confirm the correctness 
of evaluation for all CAFs before reaching the question about main, although 

http : //www.haskell . org/ghc 
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none of these CAFs are related to any of the errors. Freja can be instructed 
to start with the question about main. However, that implies stating that the 
evaluation of all CAFs is correct, which may not be the case and thus lead Freja 
to give a wrong error location. An alternative definition of the EDT could imply 
that all users of a CAF are its parents. Then a question about a CAF would be 
asked only if it were relevant and memoisation of the question and its answer 
could avoid asking the same question when another reduction using the CAF 
were investigated. 

For Hat a corresponding modification without losing sharing of CAFs seems 
to be more difficult, because the redex trail is browsed by going backwards from 
an expression to its unique parent. In our experiments the fact that a CAF 
has no parent in a redex trail is not noticeable, because none of the introduced 
errors concernes CAFs. However, programs can be constructed where this lack 
of information hinders locating an error: 

nats : : [Int] 

nats = 0 : map succ nats 

main = print (last nats) 

The computation of this program does not terminate. When the programmer 
interrupts the computation. Hat may show map succ (0 : succ 0 : □) as 
next redex to be evaluated. The parent of this redex is nats, which has no 
parent. The error may well be that the programmer intended to call another 
function than last in the definition of main, but unfortunately the redex last 
nats is unreachable. 

We stated in Section E21 that Hat has a special kind of redex for locally 
defined variables of arity zero (defined in let expressions and where clauses). 
The parent of such a variable redex is the redex that created the definition and 
not - as for function application redexes - the redex that created the application. 
So as for CAFs redexes may become unreachable. 



Guards, cases and ifs. In Haskell the selection of an equation of a definition 
may not only be determined by pattern matching but may also depend on the 
value of a guard: 

test : : (a -> Bool) -> a -> Maybe a 

test p X I p X = Just X 

I otherwise = Nothing 

In Freja the reduction of a guard (p x) is a child of the reduction of the function 
(test). Redex trails are, however, traversed backwards from the result value 
(Just X or Nothing). To hold the information about the reduction of a guard, 
redex trails have an additional sort of redexes. In the example, if the first equation 
were chosen, then the value Just x would have the parent I True <l test p 
X, and if the second equation were chosen, then the value Nothing would have 
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the parent I True <l I False <l test p x. By asking for the parents of the 
truth values True and False in the redexes, the user can obtain information 
about the evaluation of the guards. 

Similarly, Hat uses special redexes for case and if expressions. On the one 
hand, these special redexes complicate the system. On the other hand, they are 
useful for large function definitions. The special redexes enable more fine grained 
tracing up to the level of guards, cases and ifs, whereas Freja only identifies 
a whole function reduction as faulty. Similar to the situation for locally defined 
variables it is possible to extend the definition of Freja’s EDT by special nodes 
for guard, case and if reductions. For Hat, special redexes for these reductions 
are important to make parts of the redex trail reachable by backward traversal 
that otherwise would be unreachable. 

4.4 Modification of the Program 

Whereas Freja and Hat are applied to the original program, requiring only special 
compilation. Hood is based on modifying the program. Sometimes the introduc- 
tion of the observe combinator requires modifications which are non-trivial, if 
an operator is observed (because of its infix position) or if not a specific call but 
all calls of a function are observed as in our example in Section [2. 31 Furthermore, 
the main function has to be modified and the library has to be imported in every 
program module that uses its entities. Most importantly, a data type can only 
be observed if it is an instance of a class Observable. Some of our experiment 
programs define many data types; because we want to observe most of them, we 
have to write many instance definitions. Writing these instance definitions is easy 
but time consuming. Additionally, all these modifications potentially introduce 
new errors in the program and also make the program less readable. 

On the other hand it might be useful to leave the modifications for Hood in 
the program. They could be en-/disabled during compilation by a preprocessor 
flag for a debug mode. Then most modifications, especially writing instances of 
the class Observable, require only a one-time effort. The observe combinator 
may even be placed to observe the main data structures of the program. Thus 
debugging is integrated more closely into program development. In contrast, 
Freja and Hat cannot save any information from a tracing session for future 
versions of the program. 



5 Other Tracers and Debuggers 

Buddha mini is a tracing system which like Freja constructs an EDT. Its imple- 
mentation is based on a source-to-source transformation, but unlike the trans- 
formation of Hat this transformation is not purely syntax-directed but requires 
type information. Buddha is still actively developed. 

Booth and Jones m sketch a system which creates a trace quite similar to 
an EDT. The main difference is that a parent node is only connected directly 
to one child. All sibling nodes are connected with each other according to the 
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structure of the definition body of the parent node. Thus the trace has the nice 
property that all connecting arrows denote equality, unlike the arrows in an EDT 
or a redex trail. The authors describe a browser which gives more freedom in 
traversing the trace than the questions of Freja. 

There also exist several systems for showing the actual computation sequence 
of a lazy functional program. Section 2.2 of [14], Chapter 11 of |5] and Chapter 
2 and Section 7.5 of [8] review a large number of tracing and debugging systems 
for lazy functional languages. 

We could not include any of these systems in our experiments, because there 
are only limited prototypes, not publicly available. 

6 Summary and Conclusions 

We have compared and evaluated the tracing and debugging systems Freja, Hat 
and Hood by applying them to a number of programs. 

Tracing and debugging systems for lazy functional languages have made con- 
siderable progress in recent years: all three systems prove to be effective tools 
for debugging our programs. Though none of our programs is very large, some 
of them are large enough to show that the scope of application for the tools 
goes well beyond easy exercises. Unfortunately the practical usability of Hat 
and especially Freja is currently limited by the fact that they do not support 
full Haskell 98. 

Each of the tracing tools takes a unique approach with specific strengths. In 
particular, Freja has a systematic fault-finding procedure; Hat starts at the ob- 
served error and enables exploring backwards the history of every subexpression; 
Hood observes the data flow at specific program points by need. 

Based on our experiments we identify in Section [4j the strengths but also the 
weaknesses of each system. For some weaknesses we already suggest improve- 
ments, often based on the convincing solutions of the problems in other systems. 
Other weaknesses are linked either to the tracing method or the implementa- 
tion, which we discuss in Section 0 Hence they are more difficult to address 
and require further research. For example, Freja cannot take advantage of the 
common case that only a subexpression of a reduction is wrong. Hat is slow 
and Hood gives almost no indication of how values are related. We claim that 
an integration of Freja into Hat is feasible whereas Hood’s approach is rather 
different from the approaches of the other two systems. 

Finally, good tools are not sufficient for debugging. The user needs advice on 
how to effectively use each system; a strategy needs to be developed for Hat and 
especially for Hood, but even Freja would benefit from advice on how to employ 
its advanced features. Also a strategy for using several systems together, taking 
advantage of their respective strengths, is desirable. 
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Abstract. Pure, functional programming languages offer several solu- 
tions to construct Graphical User Interfaces (GUIs). In this paper we 
report on a project in which we port the Glean Object I/O library to 
Haskell. The Glean Object I/O library uses an explicit environment pass- 
ing scheme, based on the uniqueness type system of Glean. It supports 
many standard GUI features such as windows, dialogues, controls, and 
menus. Applications can have timing behaviour. In addition, there is sup- 
port for interactive processes and message passing. The standard func- 
tional programming language Haskell uses a monadic framework for I/O. 
We discuss how the Object I/O library can be put in a monadic frame- 
work without loosing its essential features. We give an implementation 
of an essential fragment of the Object I/O library to demonstrate the 
feasibility. We take especial consideration for the relevant design choices. 
One particular design choice, how to handle state, results in two versions. 



1 Introduction 

The pure, lazy, functional programming language Clean |9TT7|^ offers a sophis- 
ticated library for programmers to construct Graphical User Interfaces (GUI) on 
a high level of abstraction, the Object I/O library. The uniqueness type system 
gm of Glean is the fundamental tool to allow safe and efficient Input/Output. 
This has been taken advantage of in the Object I/O library, which employs 
an explicit multiple environment passing style (a less precise but more concise 
term is “world as value”). From the outset on m one of the key features of the 
Glean I/O project has been the explicit handling of state, and the specification of 
graphical user interfaces at a high level of abstraction. The approach has proven 
to be successful and flexible, allowing the model to be extended with interactive 
processes (on an interleaving and concurrent basis 0), message passing (syn- 
chronous and asynchronous), and local state resulting in an object oriented style 
ms]. The library provides a rather complete set of GUI objects for real-world 
applications and produces efficient code. This has been demonstrated by writing 
a complete integrated development environment, the CleanIDE. 

In Glean the uniqueness type system is used to support I/O in an explicit 
multiple environment passing style. Two other styles of solutions have been pro- 
posed to handle I/O in a purely-functional setting: stream based and monad 

M. Mohnen and P. Koopman (Eds.): IFL 2000, LNCS 2011, pp. 194-|2]^ 2001. 
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based [I27I18| . The standard functional programming language Haskell [15120] 
initially adopted a stream based solution up to version 1.2. From version 1.3 
on monads were firmly integrated in the language. Many interesting experimen- 
tal frameworks have been proposed to handle GUI programming in both styles 
f |10ll6l23l25| to name a few). For a broad overview see Section |7| 

In this paper we report on a project in which we ported a core subset of this 
I/O system to Haskell. There are several motives to embark on such a project. 

— Monads are considered to be a standard way of handling I/O in pure func- 
tional languages. In this project we demonstrate that it is possible to transfer 
the concepts of the Object I/O system to a monadic framework. 

— Designing a solution to functional GUI programming is one thing, but it is a 
truly large effort to maintain, extend and improve such a system. The Glean 
Object I/O library has proven itself in practice. It is efficient and in a fairly 
stable state. Porting this library to Haskell is a relatively small effort. 

— When comparing programming languages and the applications written in 
them, it is crucial to share identical libraries. Especially for the important 
application domain of interactive applications, the lack of these libraries 
makes it hard to do serious comparative studies. 

— The development of the Object I/O system and the Glean language have 
mutually influenced each other beneficially. One can expect similar effects 
between library and language when porting the system to Haskell. 

— The Haskell compiler that we use in this project is the Glasgow Haskell 
Gompiler 4.08.1. It extends Haskell 98 with several features that are required 
by the Object I/O system {existential types and a, foreign function interface). 
In addition to these features it supports a variety of useful extensions such as 
rank-2 polymorphism, thread creation, and communication/ synchronisation 
primitives. In this project we show how we have used these extensions to 
simplify the implementation of interactive processes and message passing. 

One might wonder if this project is bound to fail in advance, because Glean 
and Haskell use different basic techniques to bring pure functional programming 
and I/O into close harmony. The answer is no because even though the Object 
I/O system uses the world as value paradigm, it does not essentially rely on it. 
The key idea of the system is that GUI objects are described by algebraic data 
types. The behaviour of a GUI object is defined by a set of callbacks. A callback is 
essentially a piece of code that must be executed in well-defined circumstances 
(usually called events). In a world as value paradigm one can simply model 
these callbacks as functions of type (state , EWorld) -> (state , *World) . In a 
monadic framework these callbacks can be modeled as monadic actions of type 
state -> 10 state, or even just 10 O. 

A closely related question is whether it is possible to handle local state in 
a monadic framework in a way that reflects the philosophy of the Object I/O 
library. We show that it is possible to provide a translation to Haskell that 
(except for the obvious difference in callbacks) is exactly identical to the Glean 
version. However, we also explored an alternative design, in which state is held 
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in mutable variables, an approach that turns out to give a considerably simpler 
type structure. Because the local state version of the Object I/O library has 
been discussed at length elsewhere |4I6| . we will discuss the alternative mutable 
variable based design in full detail in this paper, and compare it with the local 
state version in Section 0 

The Clean Object I/O library is big. Version 1. 2. 1 consists of 145 modules 
that provide an application programmer’s interface (api) of 43 modules giving 
access to roughly 500 functions and 125 data types. For a feasibility study this is 
obviously a bit to much to port, so we have restricted ourselves to a fragment of 
the API that contains the essential features. This subset, the mini Haskell Object 
I/O library^ is sufficiently expressive to create as a test case target a concurrent 
talk application (see Figure 1(a)). In the mini Haskell Object I/O library you 
can open and close arbitrarily many interactive processes (two in the test case). 
Each interactive process can open and close arbitrarily many dialogues (one in 
each interactive process). Each dialogue can contain arbitrarily many text-, edit-, 
and button controls (in the test case the dialogues contain two edit controls, one 
for input, one for output). In addition we have ported asynchronous message 
passing (text typed in the upper edit control is sent to the receiver of the other 
interactive process which displays the text in the lower edit control). 
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Fig. 1. (a) Concurrent talk 
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(b) Layered architecture 



Another means of reducing the porting effort to Haskell is by making use 
of the layered architecture of the Clean Object I/O library (see Figure 1(b)). 
The Object I/O library basically consists of two layers: at the bottom we have a 
layer that implements the actual interface with the underlying operating system. 
This is the OS dependendent layer. It defines an interface that is used by the top 
layer, which is therefore OS independent. The OS independent layer is written 
entirely in Clean. The OS dependent layer has been designed in such a way that 
it is relatively easy to implement on most kinds of GUI toolkits. For this we 



Porting the Clean Object I/O Library to Haskell 197 



have drawn on our experience of porting earlier versions of Clean I/O libraries 
to platforms as Microsoft Windows, Macintosh, and X Windows. 

The remainder of this paper is structured as follows. We start with a detailed 
discussion of the mutable variable based version of the mini Haskell Object I/O 
library API in Section [2l We then compare this new approach with a one-to-one 
translation of the Clean Object I/O library to Haskell in Section |3] The imple- 
mentation of the OS independent layer (Section [4|) and the OS dependent layer 
(Section E} are basically the same for both versions. Porting the Clean Object 
I/O library to Haskell is a good opportunity to compare the two languages, li- 
braries, and tools. This is done in SectionE] We present related work in SectionH 
and conclude in Section E] 

2 The Mini Haskell Object I/O API 

As our first step, we present the design of the mini Haskell Object I/O system, 
as seen by the programmer. The version presented in this section handles state 
by means of mutable variables. The design rationale is basically the same as the 
local state version of the Object I/O library, so we will not discuss these. Instead 
we content ourselves with a brief overview based on examples. 

As has been argued briefly in the introduction, the only true language inde- 
pendent difference between the two libraries is the way callbacks are represented. 
We give the monadic approach in Section [01 Then we illustrate the way local 
state is handled in Section 12.21 The remaining essential GUI components that 
are required for the concurrent talk test case are handled in Section 12.31 

2.1 A Monad for State Transitions 

The principal concept to grasp about the Object I/O library is that it is a state 
transition system. The behaviour of every GUI object that can be defined in 
the library is a eallbaek that, when it needs to be evaluated, is applied to the 
‘current’ process state and returns a new process state. The new process state 
is the next ‘current’ process state. The programmer only needs to define initial 
state values and the GUI objects that contain the behaviour functions. The 
Object I/O system takes care of all GUI event handling and ensures that the 
proper functions are applied to the proper state. 

In the Glean Object I/O library, the process state is handed to the program- 
mer explicitly as environment value of abstract type lOSt (called the I/O state). 
This environment is managed entirely by the Object I/O system. Every callback 
is forced by the uniqueness type system of Glean to return a unique I/O state. 
As the I/O state is an abstract value, and there are no denotations available to 
the programmer we can ensure that all GUI operations can be performed safely. 

In Haskell, instead of passing the I/O state around explicitly, we encapsulate 
it in a monad, in the standard way: 
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data GUI a = GUI (lOSt -> 10 (a,I0St))> 

instance Monad GUI where 
(»=) = bindGUI 

return = returnGUI 

bindGUI : : GUI a -> (a -> GUI b) -> GUI b 
bindGUI (GUI fA) to_ioB ioSt 

= GUI (\ioSt -> do { (a.ioStl) <- fA ioSt ; 

case to_ioB a of 

GUI fB -> fB ioSt }) 



returnGUI : : a -> GUI a 

returnGUI a = GUI (\ioSt -> return (a,ioSt)) 

Defining the GUI monad to be an enhanced 10 monad allows us to combine 
existing Haskell I/O code with the Object I/O code. For this purpose one can 
lift any 10 action to a GUI action: 

liftlO : : 10 a -> GUI a 

liftlO m = GUI (\ioSt -> m »= \a -> return (a,i°St)) 



2.2 A Simple Example 



Let us write a GUI application that displays an up-down counter: a displayed 
number, together with a button to increment it and another to decrement it: 






One writes a program with a graphical user interface by defining a value of type 
GUI 0 and then “running” it by applying startGUI: 

main : : 10 0 

main = startGUI upDownGUI 

upDownGUI : : GUI () 

upDownGUI = do { counter <- newCounter 

; openDialog (Dialog "Counter" counter [])}■ 

newCounter :: GUI (TupLS TextControl (TupLS ButtonControl ButtonControI) ) 
newCounter = ...to be defined shortly... 

Here, newCounter creates one instance of our up-down counter, while open- 
Dialog opens a window in which the up-down counter is wrapped: 

StartGUI : : GUI () -> 10 0 

class Dialogs d where 

openDialog : : d -> GUI () 

The function openDialog opens a dialogue window (Dialog . . . ) whose con- 
tents can include all manner of things, which is why it is overloaded. Indeed, as 
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you can see, the type of newCounter expresses the fact that it returns a compo- 
nent composed of three sub-components. 

The next thing we must do is to define newCounter. A new feature of the 
mini Haskell Object I/O library, when compared with the Clean Object I/O 
library is the way local state is handled. We have chosen to use mutable variables 
m to handle local and public state. In this approach local state can still be 
encapsulated in the object, and hidden from the context in which it is used, thus 
supporting reusable GUI objects. Here, then, is how we define newCounter: 

newCounter :: GUI (TupLS TextControl (TupLS ButtonControl ButtonControl) ) 
newCounter = do { c_state <- newMVar 0 
; disp_id <- openid 

; let display : : TextControl 

display = TextControl "0" [Controlld disp_id] 

dec, inc :: ButtonControl 

dec = ButtonControl [ControlFunction down] 

inc = ButtonControl "+" [ControlFunction up] 

up, down : : GUI () 

up = update disp_id c_state (+ 1) 
down = update disp_id c_state (- 1) 

; return (display :+: dec :+: inc) I 

update : : Id -> MVar Int -> (Int->Int) -> GUI () 

— Update the MVar, and display new value in control identified by Id 
update d m f = do { V <- takeMVar m 
; let new_v = f v 
; putMVar m new_v 
; setControlText d (show new_v) I 

newCounter uses the GUI monad to create (a) a mutable cell, c_state, that 
will contain the state of the counter, and (b) a unique identifier, disp_id, used to 
name the display. Then it constructs the three sub-components, display, dec, 
and inc, composes them together using (: + :), and returns the result. 

To achieve all this, we used the following library functions and data types: 

newMVar : : a -> GUI (MVar a) 
takeMVar : : MVar a -> GUI (MVar a) 
putMVar : : MVar a -> a -> GUI 0 

openid : : GUI Id 

setControlText : : Id -> String -> GUI () 
infixr 9 :+: 

data TupLS a b = a :+: b 

data ButtonControl = ButtonControl String [ControlAttribute] 

data TextControl = TextControl String [ControlAttribute] 



data ControlAttribute = Controlld Id 
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I ControlFunction (GUI ()) 

I ControlKeyboard (...) (...) 

(KeyboardState -> GUI ()) 

I ... 

The MVar family allow you to create and modify a mutable cell; these oper- 
ations are described in detail in m- 

For every GUI object an algebraic data type is provided that describes what 
that object looks like and how it behaves. Every type has a small number of 
mandatory arguments and a list of optional attributes — see the definitions for 
ButtonControl and TextControl given above. The TupLS type allows you to 
compose two controls to make a larger one. Notice, though, that the entire GUI 
component is simply a data value describing the construction of the component. 

The component can be given a behaviour by embedding callbacks in the 
attributes of the component. In particular, the ControlFunction attribute of 
the inc and dec controls is a callback that updates the counter. This call-back 
is run whenever the button is clicked; simply calls update. The latter updates 
the state of the counter, and uses setControlText to update the display. 

In order to change GUI components we need to identify them. That is what 
disp_id ; : Id is doing. It is used by the callbacks up and down to identify the 
GUI component (disp) they want to side-effect. Indeed, MVars and Ids play a 
very similar role: an MVar identifies a mutable location, while an Id identifies a 
mutable GUI component. Fresh, unique Id values are created by openid. 

It is very useful to be able to create Id and MVar values at any place in the 
program (see Section 12.31 1 . For this reason, it is convenient to overload these 
functions so they can be used in either the 10 or GUI monad: 

class Ids m where 

newMVar : : a -> m (MVar a) 
takeMVar : : MVar a -> m (MVar a) 
putMVar : : MVar a -> a -> m () 

openid : : m Id 

. . . Ids also has other methods . . . 

instance Ids 10 

instance Ids GUI 

2.3 Concurrent Talk 

As a second example we take the concurrent “talk” program, depicted in Fig- 
ure 1(a). Text typed into the upper panel of either window should be echoed in 
the lower panel of the other window. 



Receivers. This application involves two concurrent “processes”, and we re- 
quire a channel of communication going in each direction. The following func- 
tions manipulate channels: 
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class Ids m where . . . 

openRId : : m (Rid a) 

asyncSend : : Rid msg -> msg -> GUI SendReport 

class Receivers rdef where 

openReceiver : : rdef -> GUI () 
instance Receivers (Receiver msg) 

data Receiver msg 

= Receiver (Rid msg) (msg -> GUI ()) [ReceiverAttribute] 

A new channel is created by openRId, which is overloaded like openid, and 
returns a typed receiver name of type Rid. You can send a message to a receiver 
using asyncSend. That triggers a callback in a (non-displayed) component of 
type Receiver. The latter contains its identifier together with the callback to 
be run when the message is received. 

Interactive Processes. So the main program looks like this: 

main : : 10 () 

main = do { a <- openRId 

; b <- openRId 

; let talkA = talk "A" (a,b) 
talkB = talk "B" (b,a) 

; startProcesses [talkA, talkB] } 

talk : : String -> (Rid TalkMsg, Rid TalkMsg) -> Process 
talk str (me, you) = Process (...) (talkGuts str (me, you)) 

[ProcessCIose (quit you)] 

talkGuts : : String -> (Rid TalkMsg, Rid TalkMsg) -> GUI () 
talkGuts = . . .to be defined. . . 

quit : : Rid TalkMsg -> GUI () 
quit = ... to be defined . . . 

The overloaded function startProcesses takes an interactive process defini- 
tioi|3 and evaluates them until all child processes have terminated. The two talk 
processes are identical except for their (string) name and receiver identification. 
This is expressed conveniently by parameterisation of the talk function. 

We consider an interactive process to be a collection of GUI objects that share 
some common user interface. A process performs no independent computational 
activity other than the callback mechanism. Interactive processes are specified in 
the same way as all other GUI objects by means of an algebraic type constructor, 
which is defined in the library as follows: 

data Process = Process (...) (GUI ()) [ProcessAttribute] 
startProcesses : : [Process] -> 10 () 
closeProcess : : GUI () 

or, in the real library, a (nested) list of them 



1 
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Process has two mandatory arguments (we ignore the first) and a list of 
optional attributes. The (GUI ()) argument is the initialisation action oi an in- 
teractive process: it is the first action of the interactive process, and is run when 
the process is started by startProcesses. We will consider only one process 
attribute, ProcessClose, which is analogous to the WindowClose attribute dis- 
cussed above: the callback associated with this attribute is evaluated whenever 
the user dismisses the interactive process. 

In this example, the initialisation action is defined by talkGuts: 

talkGuts : : String -> (Rid TalkMsg, Rid TalkMsg) -> GUI () 

talkGuts str (me, you) 

= do f out Id <- openid 
; inid <- openid 

; let talkdialog : : Dialog (TupLS EditControl EditControl) 
talkdialog = mkTalkDialog you inId outid 

receiver : : Receiver TalkMsg 

receiver = Receiver me (receive outid) [] 

; openDialog talkdialog 
; openReceiver receiver } 

mkTalkDialog : : Rid TalkMsg -> Id -> Id 

-> Dialog (TupLS EditControl EditControl) 

mkTalkDialog you inid outid 

= Dialog ("Talk "++name) (infield:+:outfield) [WindowClose (quit you)] 
where 

infield = EditControl "" (ContentWidth "mmmmmmmmmm") 5 

[ Controlld inid, ControlKeyboard (...) (...) input ] 
outfield = EditControl "" (ContentWidth "mmmmmmmmmm") 5 

[ Controlld outid, ControlPos (Below inid, zero) ] 

input : : KeyboardState -> GUI () 

input = . . .to be defined. . . 

This code creates two new Ids to identify the two panels of the window, con- 
structs the dialogue and receiver, and then opens them. The receiver is straight- 
forward — we defined Receiver in the previous section — and is passed the 
callback (receive outid). The dialogue is built in a very similar way that we 
built the counter earlier, except that it uses editable-text panels (EditControl) 
instead of buttons. The ControlKeyboard attribute of the infield takes a call- 
back, input, which tells the control how to respond to user input. 

Message Passing. The remaining pieces of the puzzle are those that send 
messages. First we need to define the type of messages that flow between the 
two processes. As specified informally in the introduction, keyboard input in 
the input field of the talk dialogue of one interactive process should be sent to 
the other interactive process (and displayed in the output field). In addition, 
if the user dismisses either dialogue, the other one should also be notified as 
terminated. This is arranged by the following simple message type: 
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data Message = NewLine String I Quit. 

The receiver callback action straightforwardly implements the informal spec- 
ification above: the response to a (NewLine text) message should be to change 
the content of the output field to text (using the library function setControl- 
Text), and the response to a Quit message should be to terminate its parent 
process (using the library function closeProcess): 

receive : : Message -> GUI () 

receive (NewLine text) = setControlText outid text 
receive Quit = closeProcess 

The behaviour of the input callback is to read the current content of the 
input control and send it to the other interactive process. (The library function 
getParentWindow returns an abstract value that represent the current state of 
the complete dialogue. The function getControlText retrieves the content of 
any text related control. Their types are included below.) 

input : : KeyboardState -> GUI () 
input 

= do { Just window <- getParentWindow inid 

; let text = fromJust (snd (getControlText inId window)) 

; error <- asyncSend you (NewLine text) 

; return () } 

— Library types : 

getParentWindow : : Id -> GUI (Maybe WState) 
getControlText :: Id -> WState -> (Bool, Maybe String) 

Finally, the quit callback closes its own process, and sends a Quit message 
to the other process: 

quit : : Rid TalkMsg -> GUI () 

quit you = do { asyncSend you Quit; closeProcess i 

It should be observed that it is not possible in the Object I/O library for one 
interactive process to terminate other interactive processes. The closeProcess 
function has no process identification, but always terminates the interactive pro- 
cess of the GUI component which callback evaluates this function. This is also 
the case for all other actions: one interactive process can not directly create 
or close a window in another interactive process. The only interaction between 
interactive processes is message passing or via the external world. 



3 The Pros and Cons of MVars 

The mini Haskell Object I/O API discussed so far relies on mutable variables 
to keep track of the state of an interactive program. This is different from the 
way state is handled in the Clean Object I/O library. We will not discuss this 
system in detail, because this has been done extensively elsewhere m- Briefly, 
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the Clean Object I/O system keeps track of all state. Type security is obtained 
by parameterisation of all GUI type constructors with the types of the local 
and public state. Specialised type constructor combinators are required to ob- 
tain the proper state encapsulation. Both approaches to handle state have been 
implemented in the mini Haskell Object I/O port. In this section we analyse the 
features of the two approaches. 

Handling state with mutable variables has a number of advantages when 
compared with the Clean scheme. 

Firstly, the set of type constructors is simpler (no state type variables) and 
smaller (local state type constructor combinators are superfluous). In our expe- 
rience these elements of the Object I/O library cause a steep learning curve to 
novice GUI programmers. Despite the reduction of complexity, GUI definitions 
are identical to the local state version (which is identical to the Clean version). 
This allows to easily convert code between these versions. 

The second advantage is increased flexibility. GUI components can share 
state in a more complex way than is possible in the local state version. It is 
unclear if it is possible to extend the local state version with more powerful 
state combinators, but even so this will increase the complexity of the system. 

Thirdly, even though mutable variables are globally accessible, the fact that 
one requires its reference gives the programmer fine grained control over the 
actual access to the data. This is in fact analogous to the current situation with 
identification values: we can use the same lexical scoping techniques to control 
access to GUI objects as well as state objects (and get similar ‘preambles’ as 
discussed at the end of Section [231) . 

The major disadvantage of handling state with mutable variables is that it 
is less declarative. The burden of state management is shifted from the library 
implementer to the application programmer. To illustrate this case, here is the 
local state version of the up-down counter fSection [2.2D : 

newCounter 

= do { disp_id <- openid 

; let . . . control definitions are identical . . . 
up = update disp_id (+ 1) 
down = update disp_id (- 1) 

; return (NewLS 0 (display :+: dec :+: inc)) } 

update :: Id -> (Int->Int) (Int,ps) -> GUI ps (Int.ps) 

update d f (v, state) = do { let new_v = f v 

; setControlText d (show new_v) 

; return (new_v, state) } 

Instead of retrieving and storing the current count explicitly from a mutable 
variable, update has direct access to the local state, and is required by the type 
system to return a new value. The relation between the initial local state value 
0 and the counter is determined by the NewLS type constructor combinator. 

We need more experience to decide which of the approaches to handle state 
is the best choice. 
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4 The OS Independent Layer 

The only crucial difference between the Clean and Haskell version of the Object 
I/O libary is the way callbacks are handled (functions versus actions). The Clean 
Object I/O library is a sequential implementation that encodes the interactive 
process scheduling mechanism and message passing. This implies that it is in 
principle sufficient to reimplement only the callback evaluation mechanism, and 
simply translate the other parts from Clean to Haskell. 

In the introduction we have stated that we were going to use GHC (for issues 
related to other Haskell compilers we refer to Section [HD . The major motivation 
is that the Object I/O library (in fact, some of its predecessor versions [3]) has 
been designed with concurrency in mind: it should, in principle, be possible to 
implement interactive processes as concurrent evaluation processes, and even to 
create distributed interactive applications. These are things that are well sup- 
ported by GHC, and the combination with other required features fixed the 
choice for this particular compiler technology. 

We are not going to discuss every detail of the implementation of the OS 
independent layer. Instead we focus on the following aspects of its implemen- 
tation: in Section 14.11 we discuss how monadic callbacks with local state can 
be evaluated, in Section 14.21 we show how interactive processes are mapped to 
concurrent threads, and in Section [4. 31 how message passing is handled. 

4.1 Evaluation of Monads with Local State 

Computing state transitions is straightforward in the mini Haskell Object I/O 
system based on mutable variables: whenever a callback action must be evalu- 
ated, the run-time system only needs to locate the proper action and apply it. 
All state handling is done by the callback action. In the local state version things 
are more complicated: the run-time system not only needs to locate the proper 
action, but also construct the proper state argument to apply the action to. The 
thus computed new state must then be restored in the administration. 

The Clean Object I/O implementation uses an elegant solution to compute 
and store local state [S|. It relies on lazy evaluation to create references to local 
state values that will eventually be computed by callback functions (ensured by 
the type system) . These references are stored in the internal administration (the 
lost environment) which is passed as the argument to the callback function. 
Because the involved environments are explicitly available in the Clean Object 
I/O library, it is rather intuitive to ‘connect’ these forward references. In the 
local state version of the mini Haskell Object I/O library we have been able to 
copy this strategy, using the monadic extension fixIO. Due to lack of space we 
omit a detailed presentation. Here the key idea is that fixIO also allows us to 
manipulate results that are not yet computed, but lazily available. 

4.2 Interactive Processes 

The concurrent talk test case spawns two interactive processes. Interactive pro- 
cesses have been designed with concurrent evaluation in mind. They should 
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behave as if they were independent applications running in a pre-emptive OS. 
We have implemented interactive processes using the Concurrent Haskell primi- 
tives [19] forklO (for thread creation), MVars (for sharing context information), 
and Chcinnels (for abstract event dispatching). Because the Microsoft Windows 
OS expects a single event loop driven application, we can’t implement each of 
these processes as independent loops fetching and dispatching OS events. The 
architecture of the concurrent implementation is sketched in Figure [2] 




Fig. 2. Concurrent implementation of interactive processes 



The co-thread architecture is a legacy from the Clean implementation. We 
can’t call Clean from C. This is necessary on the Microsoft Windows platform, 
because some OS calls require further callbacks to be evaluated before the call 
is finished (for instance, when creating a window several dozens of callbacks are 
triggered) . Instead of calling Clean from C directly, this communication is done 
indirectly via two OS threads that run as co-routines. Information is passed via 
a small globally accessible buffer. It should be noted that in Haskell one can call 
Haskell from C so we should be able to eliminate these co-threads. Because 
of the preliminary nature of this project, we have not changed the code of the 
OS co-thread nor the architecture. 

The Haskell co-thread is basically a reorder of existing pieces of functional 
code in the scheduling module of the Clean implementation. Because we use 
the forklO primitive, the scheduling code disappears. Each interactive process 
(I/O process) runs in a Haskell thread. They are driven by an event loop that 
handles only abstract events. The closeProcess function terminates the loop, 
and the I/O process gets garbage collected. Abstract events are generated by an 
additional Haskell thread, the abstract event dispatcher, that maps OS events 
to abstract events and dispatches them to the proper I/O processes. Recall that 
interactive programs are created with the startProcesses library function. This 
function creates the initial Haskell threads. 
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The only information required by the abstract event dispatcher is which I/O 
processes are currently present in order to dispatch the proper abstract events. 
This is stored in the globally accessible Context, stored conveniently in a MVar. 
Every I/O process maintains an OS event filter function in this Context. After 
every callback evaluation, it updates this entry, and in the act of termination 
it removes it from the administration. The abstract event dispatcher terminates 
when this list is empty, resulting in the required behaviour of startProcesses. 

The major advantage of this architecture is that it is scaleable: it is easy to 
create and destroy interactive processes. It is also very suitable for a distributed 
environment if ever Object I/O applications distribute themselves over a net- 
work almost everything we need to do is to create a new remote initial context, 
an event dispatcher, and an initial I/O process thread. 

4.3 Message Passing 

In the concurrent implementation as described above Channels are the obvious 
Concurrent Haskell medium to implement message passing. Recall that a receiver 
that handles messages of some type msg is unambiguously identified by a receiver 
identification value of type (Rid msg) (Section [T]). Its implementation is 

data Rid msg = Rid { rid::Int, ridIn::(Chan msg) } 

The rid field is a fresh value to uniquely identify the receiver (inherited 
from the original implementation). The ridin field is new. It is a Channel that 
implements the message queue. 

Messages are sent with the function asyncSend : : Rid msg -> msg -> GUI 
SendReport. After the usual correctness checks it places the message in the mes- 
sage queue using writeChan. The receiver is notified that a message is available in 
the message queue by inserting a pseudo OS event in the abstract event stream 
environment (which is part of the shared Context). This pseudo OS event is 
mapped to an abstract event which is dispatched to the parent I/O process of 
the receiver. The receiver will eventually remove the message from its message 
queue (using readChain) and handle the appropriate callback action. 

5 The OS Dependent Layer 

In this project we have reused the existing C code completely. So we have inte- 
grated these C modules into the Haskell implementation. For Haskell 98 the For- 
eign Function Interface E] has been proposed to be able to write down Haskell 
code that calls upon foreign functionality. We illustrate how this has been done 
by means of the following C procedure, defined in the module cpicture.c: 

extern void WinGetStringWidth 

(CLEAN_STRING , CLEAN_STRING , int , int , int , HOC , OS , int * , OS*) 

CLEAN_STRINGs point to structs of a length field (int) and a buffer of chars 
of the given length. The types HOC and OS are also integers. If a Clean function 
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returns a tuple of results, then these are passed by the C procedure by means of 
pointer values (int* and OS* respectively). The C procedure returns void. If a 
Clean function returns one value then the C procedure also returns that value. 
The Clean code looks as follows (module pictCCall_12 . id): 

WinGetStringWidth :: ! -[#Char} !Fnt ! Int !HDC ! *QS -> (!Int,!*0S) 
WinGetStringWidth 
= code { . inline WinGetStringWidth 

ccall WinGetStringWidth "SSIIIII-II" 

. end 

} 

In the Haskell implementation we need to add marshalling code to convert 
the Haskell arguments to the arguments as required by the C procedures. For 
all functions we follow the same scheme. Here is the Haskell code: 

WinGetStringWidth : : String -> Fnt -> Int -> HDC -> 10 Int 
WinGetStringWidth al (a2,a3,a4) a5 a6 
= do si <- createCLEAN_STRING al 
s2 <- createCLEAN_STRING a2 

01 <- malloc 4 

02 <- malloc 4 

cWinGetStringWidth si s2 a3 a4 a5 a6 osNewToolbox ol o2 
mapM_ freeCLEAN_STRING [sl,s2] 
rl <- fpeek ol 
r2 <- free o2 
return rl 

foreign import stdcall "cpicture" "WinGetStringWidth" 

cWinGetStringWidth : : Addr -> Addr -> Int -> Int -> Int -> HDC -> Int 
-> Addr -> Addr -> IQ () 

Haskell Strings are lists of characters. These are converted to CLEAN_STRINGs 
using the function createCLEAN_STRING. This function has been implemented in 
Haskell, using the GHC language extensions modules Addr, Bits, and Storable. 
For all output arguments memory is allocated, using malloc. This is probably 
extremely inefficient but at the time of writing our prior interest was correctness. 

When all arguments have been created the C procedure can be called. The 
connection is made with the foreign import statement which identifies the C 
module and procedure name. As you can see, the type of the function closely 
follows that of the C procedure given above. After evaluation, the necessary 
results need to be freed. This is done by freeCLEAN_STRlNG for strings, and 
fpeek which before freeing its argument peeks it and returns the value. 

6 Experience 

This project was carried out by an experienced Clean programmer and an ex- 
perienced Haskell programmer. This was a good occasion to compare the two 
languages. Clean and Haskell are clearly cousins. Proficient Clean (Haskell) pro- 
grammers can master Haskell (Clean) easily. Still, the two languages have their 
advantages when compared with each other. Clean languages advantages are: 




Porting the Clean Object I/O Library to Haskell 209 



Records: Haskell field labels are fairly equivalent to Clean records. Haskell field 
labels automatically convert to field selector functions. Therefore one can’t 
existentially quantify these fields because the field selector functions become 
ill-typed. In Clean, a record is basically an algebraic data type with one data 
constructor. The fields identify the arguments of the constructor. The normal 
type rules apply, including those for existential types. Finally, Clean allows 
the same record field name to occur in several record types. These expressions 
can always be disambiguated by either a unique combination of field names 
or by adding the type constructor name inside the record notation. As an 
illustration, the following definitions are valid Clean (in Haskell syntax - 
note the absence of a data constructor): 



data R1 = ■[a::Iiit, b 


: Bool} 








data R2 = -[a::Bool,c 


:Real} 










— Inferred 


types : 


fl {a,b} = (a,b) 


— fl : 


R1 


-> 


(Int ,Bool) 


f2 {a,c} = (a,c) 


— f2 : 


R2 


-> 


(Bool, Real) 


f3 {R1 1 al = a 


— f3 : 


R1 


-> 


Int 


f4 {R2 1 al = a 


— f4 : 


R2 


-> 


Bool 



Macros: although in GHC one can use the C preprocessor, one can not export 
macros which limits their use. Constant functions don’t help because you 
can’t use them in pattern-matches. 

Strictness annotations. Strictness is a well-established concept in Clean. One 
annotates data types, function argument types, and local definitions strictly. 
This gives Clean programmers fine grained control over evaluation order. 

Type constructor operators. In Clean and Haskell type and data construc- 
tors have separate name spaces. In Clean these spaces contain the same 
range of symbols. This allows us to give all Clean Object I/O library GUI 
type definitions (including : + : ) to have identical type and data constructor 
names everywhere. 

Module structure. The Clean module system distinguishes implementation 
and definition modules. This basically means that Clean programmers write 
their .hi files. Definition modules are not allowed to be cyclicly dependent, 
but implementation modules are. Compiling a project therefore involves com- 
piling a tree of modules. 

Haskell language advantages are: 

Field labels are also selector functions. As a programmer you do not have to 
write your own access functions. This results in elegant code. 

Derived instances help programmers avoid writing code that can be derived 
by the compiler. 

Rank 2 polymorphism. Clean and GHC Haskell support existential types. 
However, the GHC makes this language feature complete by extending the 
type system with rank-2 polymorphism. This extension allows one to write 
higher order functions on existentially quantified data structures. 
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Cyclic modules. Because Haskell modules consist of one file with an interface 
header (the export list), modules must be allowed to be cyclicly dependent. 
This increases expressiveness when compared with Clean. It has allowed us 
to parameterise all GUI types uniformly with respect to the state arguments. 
(Unfortunately, the GHC can’t deal with cyclic Haskell modules without help 
from the programmer who is forced to write .hi-boot files.) 

Monads and do- notation handle the environment passing part of interactive 
programs. The resulting code is less cluttered with environments when com- 
pared to equivalent Clean code. 

The mini Haskell Object I/O library implements only a part of the Clean 
Object I/O library, see the left table below. The table to its right shows this for 
the Haskell mini Object I/O library with local state {LS) and mutable variables 
(MVAR). Note that because we have used a straightforward translation from 
Clean to Haskell this results in virtually identical sizes of the OS independent 
layers. Clean does not support rank 2 polymorphism which leads to significant 
duplication of code that handles existentially quantified data structures. The 
Haskell version could take advantage of this. The Haskell OS dependent layer is 
about twice the size of the Clean version due to marshalling (Section E) . The 
shared C implementation consists of 39 *.c, *.h modules, and 13132 loc. 



Clean 


#mod. (%total) 


i^loc (%total) 


OS indep. 


50 (45.9%) 


8956 (30.5%) 


OS dep. 


20 (54.0%) 


3961 (53.9%) 



Haskell 


#mod. 


#loc LS 


#loc MVAR 


OS indep. 


54 


8583 


8202 


OS dep. 


21 


6393 


6393 



When writing large applications or libraries it becomes increasingly impor- 
tant to have dedicated development tools. From the very start Clean versions 
have been released together with integrated compiler/editor environments. In 
contrast, the GHC is a command-line based, Unix oriented system. It does not 
come with any development environment. Instead, one is supposed to use make 
files. For programmers being used to GUI based IDEs this is a rude awakening. 

A final important factor with increased application sizes is the quality of 
the compiler. GHC’s error messages are more informative than Clean’s, espe- 
cially when concerned with the type system. The Clean compiler is significantly 
faster than GHC. Compilation time measurements conducted on the Object I/O 
libraries indicate that Clean compiles 10 to 20 times as fast as GHC. 

7 Related Work 

The main purpose of this project was to study the relationship between the Clean 
Object I/O library and Haskell, and to see if and how it can be implemented in a 
monadic framework. Except for the new approach (from a Clean point of view) 
to handle (local) state using MVars, the Object I/O library has not changed. This 
is the reason why we give only a very brief comparison with related work. m 
discusses related work with respect to the ‘original’ Object I/O library.) 

We have shown how the Object I/O library can be simplified using mu- 
table variables. Mutable variables are used in several functional GUI systems 
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{TkGofer, pidgets, tclHaskell - [I26I23I22] respectively -). Fudgets and gadgets 
([M) use stream communication to model global state, and recursive func- 
tions for local state. The latter technique is used in Oval\13\ to model global 
state. Even if (local) state is handled by means of mutable variables, the Object 
I/O library differs because of its emphasis on defining GUI objects by means of 
algebraic data structures. This allows one to define functions that pass around 
GUI specifications that can be manipulated and changed before they are actually 
created. In all other systems GUI objects can only be created, using actions. 

We have demonstrated that it is possible to have a concurrent implementa- 
tion of a functional GUI system without sacrificing the deterministic semantics 
of interactive processes. The use of Goncurrent Haskell primitives results in a 
simpler and shorter implementation of the library, without changing its semantic 
properties. This contrasts strongly with the general opinion that one has to use 
a concurrent functional language to construct GUI programs flexibly [123113] . 



8 Conclusions and Future Work 

In this project we have successfully ported an essential subsetll of the Glean 
Object I/O libary to Haskell. We have argued that the Object I/O library is 
really independent of the underlying paradigm to integrate side effects in pure 
functional languages. Gallbacks are modeled in Glean using explicit environment 
passing functions, while they are represented as monads in Haskell. In this way 
we preserve the best properties in both languages. 

The mini Object I/O library covers 42% of the whole Object I/O library. We 
have focussed on the ‘hard bits’. All crucial design and implementation issues 
have been solved. Due to lack of resources we have not been able to check uni- 
and bi-direction synchronous message passing and the construction of a drawing 
monad (which should encapsulate a Picture environment). Except for these 
parts, porting the rest of the code should not prove to be difficult. 

We have used the GHG 4.08.1. It implements Haskell 98 and extends it with 
several language and library features. This raises the question if this project 
has become a GHG project rather than a Haskell 98 project. The Object I/O 
library can not be implemented straight away in Haskell 98. To implement the 
‘pure’ Object I/O library one needs to extend Haskell 98 with existential types. 
The rest of the library can be obtained by translation from Glean to Haskell. To 
implement the MVar state version existential types are not required. The only 
extensions needed are MVars. These can be added to any Haskell compiler as a 
separate library, though they also require significant runtime support. 

We think that it is worthwhile for the Haskell and Glean community to 
complete this project for several reasons: (a) the Haskell community will obtain 
a GUI library that has proven itself in practice, (b) the library can, because of 
its internal architecture, be easily ported to more traditional Haskell platforms, 
(c) it will encourage code sharing between Haskell and Glean. The existence 

^ Available in the GHC CVS repository at fptools/hslibs/object-io 
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of a GUI library that is both easily portable and language independent will 
strengthen the position of functional languages on the long term. 
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Abstract. Speculative evaluation relates to computing several (alterna- 
tive) threads of control of large programs concurrently without knowing 
in advance which of them contribute to which extent to final results. 
This approach may be used to advantage to compute, at the expense 
of deploying considerable processing power, solutions of np-hard search 
problems on average a lot faster than sequentially. 

This paper addresses the organizational measures necessary to perform 
speculative computations concurrently in a distributed memory multi- 
processor system. They primarily concern task management and schedul- 
ing, a fairness regulation scheme which ensures progress of all specula- 
tive tasks at about the same pace, and the conflict between fairness and 
bounded numbers of speculative tasks. Though these measures are dis- 
cussed in the context of functional languages and systems, they are in 
principle applicable in the imperative world as well. 



1 Introduction 

Programs of functional languages are known to be perfectly suited for concurrent 
processing. Conceptually, program execution is a process of meaning-preserving 
program transformations based on a set of rewrite rules which for all semanti- 
cally meaningful programs eventually terminates with a result which is itself a 
program. Since all rewrite rules perform context-free substitutions of equals by 
equals, creating no side effects elsewhere in the program, they may be applied 
in any order without affecting the determinacy of results. 

Of the various concepts of executing functional (or function-based) programs 
concurrently, speculative evaluation is the most challenging one from an orga- 
nizational point of view. However, at the expense of committing considerable 
resources, it may also be the least rewarding one in terms of performance gains. 
The idea is to evaluate several sub-terms of a functional program concurrently 
without having, at the time the respective tasks or threads of control are being 
created, sufficient information at hand to decide which of them may contribute 
to which extent to the normal form of the entire program eventually. 

Under a lazy regime, speculative evaluation is often employed to evaluate 
some or all function arguments in advance and possibly beyond the point actually 

M. Mohnen and P. Koopman (Eds.): IFL 2000, LNCS 2011, pp. 214- I230l 2001. 
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needed, provided sufficient processing power can be made available that would 
otherwise be idling (though this so-called eager-beaver approach runs somewhat 
counter to the idea of laziness since more than absolutely necessary is usually 
done to compute normal forms) | Jon87l IParQll IMat93l IChe94| . 

This paper is on the speculative evaluation of sets of rewrite rules, speci- 
fied as pattern matching clauses embedded in case (or switch) constructs. Pat- 
tern matching, generally speaking, is to abstract specific (sub-)terms from given 
structural contexts (the argument terms to which the cases are applied) and to 
substitute them into specific syntactical positions of other contexts (the body 
terms of the matching clauses). As several patterns of a case may be overlap- 
ping, they may produce as many matches on given arguments. Under a purely 
functional interpretation, the patterns are applied in the order in which they 
are specified in the case, and the first matching clause is the one that is be- 
ing picked as the result of the entire case application. In compliance with this 
execution order, the clauses are usually arranged so that the patterns covering 
special argument features precede those that cover the more general (structural) 
features. 

However, rule based applications such as term rewriting (logic reasoning) or 
many search problems are typically of a nature where several clauses feature 
overlapping patterns which can not be given a unique ordering. If more than 
one of these patterns matches a particular argument, the respective clauses may 
have to be evaluated speculatively since sufficient information as to which of 
them will lead to some desired result (problem solution) eventually may become 
available only further down the road as more pattern clauses (rules) are being 
applied. Unfortunately, such trial-and-error computations generally feature an 
exponential complexity which renders them intractable for large problem sizes, 
particularly when doing them sequentially. However, evaluating all matching 
clauses of case applications concurrently on a speculative basis may considerably 
improve the chances of computing solutions of such problems decidedly faster, a 
generous supply of processing power provided. 

Of primary concern in this paper are the organizational measures necessary 
to support in some orderly form this kind of speculative concurrency in a mul- 
tiprocessor system both effectively and efficiently. To do so, the system must 

— distinguish between vital tasks (or threads of control) whose results are bound 
to contribute to problem solutions and speculative tasks (threads) whose 
results may or may not be required; 

— treat all vital tasks with higher priority than speculative tasks, and possibly 
distinguish several priority levels among the latter, to prevent the monop- 
olization of the system with computations of which most are known to be 
superfluous; 

— abort speculative tasks or lift their status to vital as soon as decisions to 
this effect can be made; 

— apply a fair scheduling discipline to all speculative tasks originating from 
the same case application to ensure that all of them proceed at about the 
same pace as long as no clues are available as to which of them have the best 
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chances of succeeding; otherwise too many resources may be committed to 
just the wrong computations while those that produce useful solutions are 
left starving; 

— strike an acceptable compromize between limitation of resources, specifically 
processing power, on the one hand and fair progress of potentially unbounded 
numbers of speculative tasks on the other hand. 

The paper discusses conceptual solutions for these organizational problems 
which have been successfully implemented as extensions of an existing concurrent 
graph reduction system tt-red |K]u83irHHK94|l(lK9filJ . This implementation has 
been extensively tested and validated by means of a parameterizable program 
which simulates searches of a spider for exits in a maze. 

The paper is organized as follows: the next section introduces some essential 
language constructs. Section [3] outlines the basic principles of organizing spec- 
ulative computations. Section S] describes the measures that sustain reasonable 
fairness while limiting the number of speculative tasks. Section reports on some 
performance results, and Section [^discusses some related work. 

2 The Language 

For the purpose of this paper it suffices to consider a simple dynamically typed 
and strict functional kernel language with a reduction semantics [Berk75l IKlu94] , 
i.e., program execution is governed by a set of term rewrite rules. The program 
terms are recursively constructed as follows: 

e = const I var \ prim.fun 
I ( e_0 e_l . . . e_n ) 

I IF e_0 THEN e_l ELSE e_2 
I LET uA = eA . . . ujn = e_n IN e_0 
I < e_l,...,e_n > 

I DEFINE . . . , / uA . . . u_n = e_/, . . . IN e_0 . 

I CASE patA — >■ e_l, . . . , pat_i — >■ eJ, . . . ,pat_n — >■ e_n end_CASE 
pat = const I var \ _ * pat- * | < . . . , pat-k, ... > 

These terms denote, from top to bottom, constant values, variables, primitive 
functions such as -I-, — , . . .gt, le ... etc., applications of terms e_0 in function 
position to n argument terms e_l , . . . , e_n, if_then_else clauses, let terms, 
n- tuples (n-ary lists) of terms, and sets of mutually recursive function defini- 
tions, with /, uA , . . . , urn and e_/ respectively denoting a function identifier, the 
formal parameters of the function, and the function body term (which computes 
the function value). The term e_0, which may call upon any of the functions 
defined, computes the value of the entire define construct. 

The terms of primary interest in this paper are CASE-constructs of some n 
pattern matching clauses pat_i — > Ci which may be used to define sets of rewrite 
rules for tuple terms. The patterns may be composed of constants, variables. 
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wild card patterns (denotes as _ * pat * _), and tuples of these items, including 
recursively tuples as tuple components. 

A tuple pattern pat is said to match an argument term if it has an identical 
tuple structure, each pattern constant literally equals a constant value in an 
identical tuple position of the argument, and each pattern variable matches a 
(sub-)structure in an identical argument positiorQ, This being the case, each 
occurrence of a pattern variable in the body term eJ is substituted by the 
respective argument component, and the term thus instantiated is evaluated. 

The entire CASE-construct in fact specifies a complex unary function. When 
applied to an argument value, a strictly functional interpretation requires that all 
clauses be tried in the order from left to right, and the value of the first clause 
whose pattern matches be returned as function value. This evaluation order 
guarantees determinacy of results even in the presence of several (potentially) 
matching patterns. 

If none of the patterns matches, the CASE application may simply return 
itself as its own value since it can obviously not be re-written into anything else 
Alternatively, the application may be considered undefined and be replaced by 
the bottom symbol T. 

Thus, the meaning (or the semantics) of a CASE application may be defined 
by an evaluator function eval as: 

eval[ ( CASE . . . , patJ — >■ e J, . . . end.case e_a ) ] 



' eval[ eJ[<J=] ] if MATCH {patJ, e_a) = SUCC 

and (V j G {1, . . . , n}) MATCH {pat-j, e_a) = fail 

( CASE . . . , patJ, — >■ eJ, . . . END.CASE e_a ) 

(or T ) otherwise 



where MATCH (patJ, e_a) returns SUCC (for succeed) if pat A matches the argu- 
ment term e_a, and fail otherwis^; denotes the instantiation of occur- 

rences of the pattern variables in the body term eJ by the (sub-) terms extracted 
from the matching positions of the argument term e_a. 

However, there are many interesting applications, specifically search prob- 
lems, where clauses (rewrite rules) with overlapping patterns cannot be given a 
unique ordering with respect to a best possible choice of a solution. There may 
be several promising alternatives to pursue, and the choice may have to be made 
further down the road as more information to this effect becomes available, say 

^ Trivial constant patterns match identical constant arguments, and trivial variable 
patterns match all legitimate argument terms. Wild card patterns _ * pat * _ may 
only occur as components of tuple patterns. They match sequences of geql tuple 
components in the arguments, of which one must match the pattern pat. If the wild 
card _ preceding or succeeding pat is missing, then pat must match the first or the 
last component, respectively, of the matching sequence. 

^ Note that both SUCC and fail are not elements of the functional language proper 
but are values of the function match used by the evaluator eval. 
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by repeated application of the same or other cases, and some of the alternatives 
can safely be discarded (e.g., those that produce a fail). Until such choices can 
be made, evaluating matching clauses remains inevitably speculative as it is not 
a priori known which of them will fail or succeed eventually. However, to speed 
things up, several or all matching clauses may be computed concurrently since, 
in a functional setting, they do not inflict side effects on each other. 

Denoting a CASE construct as S_CASE if it is to be evaluated speculatively, 
the meaning of an S_CASE application may be defined as: 

eval[ ( s_CASE . . . , pat_i -1 eJ, . . . end.case e_a ) ] 



{ ( S_CASE . . . , pats —1 eJ, . . . END_CASE e_o ) (or _L ) 
if V i G {1, . . . , n} MATCH {pats, e_a) = fail 

{ EVAL [ eJ[<l=] ] I MATCH {patS, e_o) = SUCC } otherwise 



i.e., the value of such an application, again, is the application itself (or the 
bottom symbol _L) if none of the patterns matches, and the set of values of the 
instantiated body terms of all clauses whose patterns match the argument value 
e_a otherwise. 

Based in this definition, it may be further qualified what a (best possible) 
problem solution (result) that is to be chosen from this set should be. There are 
basically three options available: 

— One could be content with just one of possibly several results, e.g., the 
one returned first. However, as such result may depend on execution orders 
chosen by the underlying system, it is generally non-determinate and thus 
violates the functional semantics. 

— To guarantee determinacy, the criterion for selecting a single out of sev- 
eral possible results has to be made dependent upon an algorithmic or 
application-specific property. Such a criterion could be the least number 
of rule applications performed to arrive at a result, and if there is more than 
one result that meets it, then one could pick the leftmost of the particular 
S_CASE. 

— Another alternative would be to ask for the full set of solutions. Unfortu- 
nately, it cannot generally be decided by the system whether this set can be 
computed at all since the computation may not terminate in all branches. 
However, the user could specify an upper bound on the number of rule appli- 
cations to be executed in each speculative branch, and accept as a solution 
the subset of results that can be computed within this limit. 

Selecting one of these options may be specified by means of distinct key 
words N_CASE (for non-determinate single solutions) d_CASE (for determinate 
single solutions) and m_CASE (for multiple solutions) which control appropriate 
interpretation or compilation to machine code. 

The need to count rule applications in the latter two cases goes hand in hand 
with the need to halt in some orderly form runaway computations effected, say. 
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by recursive function calls that fail to meet termination conditions. This may be 
accomplished by a system-supported count variable which prior to each program 
run is initialized with some user-specified integer value. Upon each application 
of a rewrite rule this count value is decremented by one. The computation is 
halted either if the program term has reached normal form or if the count value 
is down to zero, whichever occurs first |Berk75l IGK96| . 

This count variable may be smoothly integrated into the definition of the 
evaluator eval, which for S-CASEs then takes the form 

eval[ ( s_CASE . . . , patA — >■ e J, . . . end.case e_a ) | fc ] 



= 



( S_CASE . . . , patA — >■ e J, . . . END_CASE e_o ) I k 
if V J S {1, . . . , n} MATCH {pat A, e_a) = fail 
or if k = 0; 



(or T ) 



^ { EVAL [ eJ[<l=] I fc — 1 ] I MATCH {patA, e_o) = SUCC } otherwise 



Assuming that the count variable has value k upon entering the evaluation of an 
S_CASE application, it remains unchanged (or is replaced by the bottom symbol 
M if none of the patterns matches the argument or k is down to zero at this 
point, but is decremented by one if at least one pattern matches, whereupon the 
body terms of all matching clauses continue computing with their own copies of 
the count value k — 1. 

Using these count values, the value (or normal form) of 

— a D_CASE application can be defined as the value of the clause selected by a 
successful pattern match whose associated count variable k has the largest 
remaining value (and thus evaluated the least number of applications); 

— an M_CASE application can be defined as the set values of those clauses 
selected by successful pattern matches whose associated count variables have 
values k > 0. 



3 Controlling Speculative Soncurrent Computations 

Executing functional programs concurrently is usually based on a simple divide- 
and~conquer scheme which recursively spawns new tasks for program terms 
whose values are bound to contribute to results. The tasks at the leaves of 
the emerging task tree may be scheduled for processing in any order and non- 
preemptively since all of them are vital (or mandatory). There are also fairly 
simple measures at hand to prevent the creation of tasks far beyond the number 
of processing sites |Klu83| . 

Things become decidedly more complicated when speculative tasks enter the 
game. To commit processing capacity with top priority to useful computations, 

® Note that the bottom symbol is used in the actual system implementation to easily 
identify speculoative computations that have failed to produce a (partial) problem 
solution and therefore can be aborted. 
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vital tasks must be given preference over speculative tasks, i.e., no speculative 
task may be scheduled for processing as long as there are vital tasks ready to 
run. Moreover, vital tasks, once running, may yield processing sites only if they 
terminate or suspend themselves. Scheduling speculative tasks on the processing 
sites that are left over must be preemptive to prevent the monopolization of the 
systems resources with computations that may turn out to be useless. Processing 
sites may have to be turned over as soon as possible to newly emerging vital tasks, 
and all speculative tasks ought to be moved ahead at about the same pace as 
long as it cannot be decided which of them are going to succeed. If there are more 
speculative tasks than processing sites they can run on, they must inevitably be 
scheduled in a round robin fashion similar to time slicing to enforce fair progress 
(which in fact realizes a breadth-first evaluation strategy). 

A suitable fairness regulation mechanism may be based on the notion of 
synchronic distances as introduced in Petri net theory [GLTSOj . in combination 
with some loose form of barrier synchronization. Synchronic distances, roughly 
speaking, define upper bounds on the number of operational steps by which 
one of several competing tasks (or threads of control) can at most get ahead 
of the others before they must catch up. These steps can easily be counted in 
terms of rewrite rules performed, which nicely blends in with the count variable 
k introduced in the preceding section as part of the definition of the evaluator 
EVAL, whose primary purpose is to halt runaway computations. 



to tl t2 t3 t4 




steps to tl t2 t3 t4 




abandoned 

A 

B 

C 



Fig. 1. Fairness regulation scheme with three-barrier synchronization 



The fairness mechanism which, after experimenting with several alternative 
solutions, has emerged as the most effective one for supporting speculative com- 
putations in a distributed implementation of tt-RED involves three successive 
barriers A, B and C which are synchronic distances of some k rewrite steps 
away from each other (see Fig. [T|). Assuming that some n speculative tasks 
(there are five named tO . . . t4 shown in the figure) participating in the race have 
just managed to cross barrier A, they are supplied with a fuel of some k rewrite 
steps to reach barrier B. Tasks arriving at barrier B (and thus having exhausted 
their fuel) may cross and be refueled with another k rewrite steps to move on 
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towards barrier C. There they are blocked until all n tasks have crossed barrier 
B (the left of Fig. |T). This being the case, the three barriers are moved k steps 
ahead, i.e., the current barrier A is abandoned, barriers B and C become the 
new barriers A and B, respectively, and a new barrier C is set up k steps down 
the road from the current one (the right of Fig. [T|), and so on. 

This three-barrier mechanism keeps many of the tasks alive and computing 
most of the time, particularly if the number of tasks only marginally exceeds the 
number of available processing sites. As it must exercise control over all specu- 
lative tasks across all processing sites of a distributed system, it may be realized 
as an attachment to the root task of the program, which for this purpose must 
be kept vital and running in some unique site. The root task maintains a data 
structure which contains pointers to the context blocks of all speculative tasks 
participating in the fairness regulation. This structure is updated upon receiv- 
ing control messages pertaining to the spawning of new speculative tasks and 
to tasks changing their status from speculative to vital, terminated or aborted. 
Information about barriers reached (and possibly blocked) or crossed by individ- 
ual tasks is held in the respective context blocks and updated whenever signals 
to this effect arrive. 

Another sophisticated mechanism concerns the creation of speculative tasks, 
their registration as new parties of the fairness mechanism, and actually start- 
ing them. A vital or speculative task that executes, say, a CASE-application may 
become a parent task that creates a speculative child task for every matching 
clause, generally in some other processing site, whereupon the parent usually sus- 
pends itself since there may be little else to do but to wait for results returned 
by the children. The children, in turn, have to be registered with the fairness 
regulation mechanism and supplied with fuel to start running. The communica- 
tions necessary to do so are depicted in Fig. [2l After having created in sequence 
speculative tasks for all matching clauses (two in the example), the parent con- 
tinues until it receives acknowledge messages from the new children, together 
with their identities. The children are then immediately suspended until fur- 
ther notice. The parent registers the new children with the fairness regulation 
mechanism and suspends itsell0. The fairness regulation, in turn, confirms the 
registration and sends out start signals, together with the fuel left by the parent 
to reach the next barrier. 

This rather complex scheme is necessary to catch asynchronous messages 
which may interleave with these communications. For instance, while the parent 
is in the process of spawning children or registering them with the fairness reg- 
ulation, an incoming message might signal the parent that the next barrier has 
been opened and more fuel is available, which may be immediately passed on to 
the children. 

Termination or abortion of speculative tasks depends to some extent on the 
type of the desired result. As mentioned in Section [3] this could be just any one of 
possibly several problem solutions, e.g., the one the system returns first, a unique 

If the parent is a speculative task itself, it temporarily resigns from the fairness 

regulation as well. 
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Fig. 2. Spawning new speculative tasks 



single solution, say, the one that requires the least number of rewrite steps to 
arrive at, or all possible solutions requiring less than some pre-specified number 
of rewrite steps - the total fuel - to computcjf]. In both of the latter cases, the 
computations would produce determinate results irrespective of actual execution 
orders. In the former case, the result would be non-determinate, or dependent 
on actual task scheduling. 

A speculative task may be aborted if 

— by some criteria, typically a CASE-application with no matching clauses, it 
produces the bot symbol _L which indicates failure of the particular clause 
to produce a problem solution (compare Section [2]) ; 

— it receives a signal that a solution has been produced somewhere else and 
the task has thus become irrelevant. 

In either case the task immediately resigns from the fairness regulation, detaches 
itself from its parent, and releases its context block. 

A task terminates regularly if it produces a solution. All other tasks 

— may be aborted if just any one solution is required; 

— that have already consumed more rewrite steps are signaled to abort at the 
next barrier if the desired solution must have consumed the least number of 
rewrite steps; 

When computing all solutions up to the point of exhausting the total fuel, all 
tasks that are still alive at this point must be aborted. 

However, as results must be returned to the root task, the tasks that are on 
the path from a terminating task to the root must, of course, in all three cases 
stay alive until the results have been passed through. 

® Without such an upper bound the computation could get trapped in endless recur- 
sions and thus never terminate. 
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4 Fairness Versus Limited Resources 

There is a fundamental conflict of interest between the demand for fair progress 
of all speculative tasks on the one hand and finite resources on the other hand. 

With no information at hand to set priorities, all opportunities for speculative 
evaluation ought to be treated on an equal basis, i.e., driven ahead at about the 
same pace, as otherwise too much processing power might be wasted on just the 
wrong computations. Given a recursively unfolding search problem, a breadth- 
first search would require spawning unbounded numbers of speculative tasks 
as there is generally no upper limit on the size of the search space. However, 
spawning tasks far beyond the number of processing sites would not only create 
considerable management overhead without any further performance gains but 
also consume decidedly more memory for runtime structures and heap space 
than, say, a strictly sequential search at the other extreme. 

To overcome this problem, a compromize must be made between unbounded 
demands for resources on the one hand and enforcing strict fairness among spec- 
ulative computations as outlined in the preceding section on the other hand. 
The idea is to provide a pool of concessions (or tokens) ~ in a distributed system 
evenly spread out over its processing sites - to spawn new tasks |Klii83j . When- 
ever such an opportunity arises, the (local) pool is checked for the availability of 
a concession, and if so, the concession is taken and a new task is created; other- 
wise this opportunity is ignored and the sub-term that was to be evaluated by 
the new task is instead processed sequentially by the calling task. A terminating 
task recycles to the (same local) pool the concession it has taken, which may be 
immediately picked up to create another task. Thus, at any time there are at 
most as many tasks alive in the system as there were initially concession tokens 
in the pool. 

Ignoring opportunities to spawn new tasks in fact implies switching from a 
breadth-first to a depth-first execution mode. If this approach would be applied 
to speculative tasks, fairness would clearly be violated: several matching clauses 
of a CASE application could only be evaluated in sequence, say from left to right, 
and depth-first, i.e., further CASE applications deeper down would recursively 
be treated in the same way. 

The way out of this dilemma consists in placing, on the way down, into a 
task-specific FIFO-queue all matching clauses of a CASE application that for 
the time being must be left pending, as indicated in Fig. Elfor a recursive nesting 
of three cases that produce just two matches each. Whenever a task produces 
a bottom symbol T in the leftmost branchs, indicating that the search for a 
solution has failed, e.g., at a case application with no matching pattern, it 
continues with the first alternative at the front-end of the queue which is the 
one highest up in the unfolding computational tree. Likewise, if concessions to 
spawn new tasks become available again later on, the queues of all existing tasks 
are inspected and the pending clauses highest up across all tasks residing in a 
particular site are chosen for the creation of new tasks. 

Since a task, after having produced a bottom symbol in its leftmost branch, 
is supposed to backtrack to the topmost alternative clause and this clause is to 
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Fig. 4. Keeping pending clauses in the queue in descending order 



be taken from the front-end of the queue, the queue cannot strictly be operated 
in FIFO order. As illustrated in Fig.|U the pending clauses of the new branch 
must be sorted into the existing queue so that clauses of higher levels always 
precede clauses of lower levels of the computational tree to keep the computation 
breadth-first as much as possible. 

This strategy does depth-first evaluation whenever, after exhaustion of all 
concessions to spawn speculative tasks, there is no alternative left, but gives 
preference to breadth-first evaluation as soon as concessions are (or become 
again) available, i.e., it is reasonably fair without involving elaborate priority 
schemes, given the constraints of limited numbers of tasks. 



5 Some Performance Experiments 
with a Search Problem 

The concepts described in the preceding sections have been implemented as ex- 
tensions of a concurrent graph reduction system tt-red [GK961 IBHK94J . This 
system interprets abstract machine code to which high-level programs of the 
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functional language kir |K1u 94J (whose syntax closely resembles the one given 
in Section[2|) are compiled. Other than identifying S_ CASE-constructs as candi- 
dates for speculative evaluation, these programs do not contain any annotations 
pertaining to concurrent execution, tt-red is installed on an nCUBE/2 dis- 
tributed multiprocessor system of 32 processing sites which can be configured to 
form sub-cubes of some 2^ \ k <b sites. Each site runs a micro kernel nCX which 
supports a single tt-red process. This process in fact realizes a small kernel of 
its own which handles 7r-RED-specific task management and communication, 
including signals and messages from and to the fairness mechanism. 

Program code is downloaded from some front-end workstation into each 
nCUBE/2 processing site; program execution starts with a single task at a unique 
initial site (which also runs the fairness mechanism) from where it recursively 
spreads out over all sites of the particular configuration. The initial task in that 
site eventually also assembles the result of the computation and returns it to the 
front-end. As of now, the system interprets compiled abstract machine code as 
it was easier at this level to implement, modify and experiment with the pieces 
of code that control the creation, synchronization and termination of specula- 
tive tasks, communicate with the fairness regulation mechanism, effects status 
changes, etc.. For programs which heavily use pattern matching, code interpre- 
tation is about three times slower than compiled target machine code. 

To systematically investigate performance enhancements due to speculative 
evaluation relative to sequential program execution, we have chosen a program 
which searches for exits in a maze. Though this problem can be elegantly and 
efficiently solved by encoding the maze into a set of higher-order functions, of 
which an application of the function that represents the starting point produces 
the desired result, we have deliberately chosen a decidedly more primitive al- 
gorithm. It has the advantage of creating by relatively simple means and in a 
controlled way sufficient computational complexity and opportunities for spec- 
ulative computations. 

The search is simulated by a spider walking step by step through the maze, 
represented as a matrix which marks walls as Is and the aisles in between as 
Os, and checking at each position, by means of pattern matches, how many 
alternatives are available to make a next step. If there is more than one direction 
to follow, the computation splits up into as many speculative searches. The 
search along a particular path succeeds if an exit can be reached and fails if, at a 
particular position, there is no alternative left but to move backward. Each search 
path is kept track of by generating a list (tuple) of coordinate pairs (positions) 
where the spider changed directions or branched out into two or three different 
directions. The paths that led to exits are returned as output and those that led 
into dead ends are discarded. The program can be parameterized with respect to 
size and layout (complexity) of the maze, number and positions of exits, starting 
position and width of the steps taken by the spider. 

The program basically centers around a function explore as shown in Fig. 
which, given a certain position of the spider in the maze in terms of its coordi- 
nates X, y, moves one step ahead in the direction specified by the parameter h 
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explore x y h = 

LET XX = {new-X x h), 
yy = {new-y y h), 

IN iF{is-exit XX yy) 

THEN << XX, yy » 

ELSE LET hs = {scan XX yy h) 

IN . . . 

( S_CASE 

< _* 'east' _*>—>■ {path < xx, yy > {explore xx yy 'east')) 

< _* 'west' _*>—>■ {path < xx, yy > {explore xx yy 'west')) 

< _* 'north' _*>—>■ {path < xx, yy > {explore xx yy 'north')) 

< _* 'south' ^ {path < XX, yy > {explore xx yy 'south')) 

END_CASE hs ) 

Fig. 5. Code for the function explore that controls the movement of the spider 



(which may assume one of the values 'east' , 'west', 'north' , 'south') and finds 
out where to move next from there. 

From top to bottom, this function computes, by means of the functions new-X 
and newjy, the new pair of coordinates xx, yy, checks whether this new posi- 
tion is an exit (in which case the function terminates by returning the pair of 
coordinates), and, if not, computes by means of the function scan the direc- 
tions in which the spider may move next, and collects these directions in a tuple 
returned as function value which instantiates the let bound variable hs. This 
tuple is then taken as an argument of an S_CASE function whose pattern^ check 
it for occurrences of the four possible directions. In case of a match (of which 
there may be up to three since the spider is not allowed to move backward), 
the coordinate pair < xx, yy > is appended to the list(s) of travel positions 
computed so far and prepended to the list of positions still to be computed by 
the recursive application of the function explore to these coordinates and to the 
direction identified by the pattern match. 

The first problem investigated was the search for a single exit in a maze 
featuring the regular shape of a balanced binary tree of 128 leaves at the base 
(the left of Fig. [HI shows a tree of eight leaves). The search starts at the root 
of the tree, and terminates successfully at a leaf which is selected as the single 
exit. This exit position is moved from left to right along the leaves. The program 
execution times on the right of this figure confirm what can be expected: when 
doing a depth-first and left-to-right sequential search on a single processing site, 
the runtimes increase linearly from the exit being in the leftmost to being in the 
rightmost position. With speculative searches, runtimes drop considerably with 
increasing numbers of processing sites and become invariant against changing 
exit positions as the tree is well balanced. The speculative search on 16 processing 
sites clearly outperforms the sequential search for all but the 8 leftmost of the 128 



The patterns are specified by means of wild cards since the particular direction may 
be in any of at most three tuple positions. 
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Fig. 6. Searching for an exit in a maze shaped as a binary tree 
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Fig. 7. Searching for exits in an arbitrary maze 



possible exit positions. Thus, at the expense of involving considerable processing 
capacity, one may win in some (most) situations but lose in others. 

Similar experiments have been conducted with several square-shaped mazes 
of highly irregular internal structures of aisles and walls, with exits placed on each 
of the four borders, and with different lengths of the search paths leading to these 
exits. A typical such maze is shown on the left of Fig. [71 The searches were done 
with only one of the four or with all exits open, and with the paths to one or all 
four exits as results. Though performance figures to some extent depend on the 
particularities of the maze structure and also on the machine configuration (e.g., 
on the chosen synchronic distance of the fairness mechanism or on upper bounds 
on the number of speculative tasks), it again turned out that for all mazes and 
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system parameters investigated the speculative searches on 16 processing sites 
did decidedly better than all but the best case sequential searches, and that with 
8 or 4 processing sites the speculative searches outperformed at least the worst 
case sequential searches. Representative for these results are the execution times 
on the right of Fig. |7] 

An interesting question concerns the overhead inflicted by the organizational 
measures necessary to support speculative evaluation as described. This ques- 
tion is difficult to answer quantitatively since it would require appropriate in- 
strumentation of the runtime system and profiling of several application pro- 
grams with different system configurations (numbers of processing sites, number 
of concessions to spawn speculative tasks, synchronice distances of the fairness 
mechanism, etc.), which we simply did not have the resources to do. Comparing 
wallclock execution times with speculative evaluation activated and de-activated 
would not be very conclusive as they relate to different modes of program ex- 
ecution with completely different dynamic program behavior. However, looking 
at the performance diagrams of Figs. Eland IT] which compare speculative eval- 
uation of the same program on different numbers of processing sites, it can at 
least be concluded that the overhead must be quite substantial (i.e., in the order 
of 30% to 50%) : doubling the number of processing sites (from 4 to 8 and then 
to 16) reduces program execution times merely by about a factor of 2/3. 

Conclusive wallclock measurements were only possible with respect to the 
effect on program execution times of varying synchronic distances enforced by 
the fairness regulation mechanism. Representative are the data collected from a 
maze program that takes roughly 350.000 rule applications to find a single exit. 
It was run on a system configuration with 8 processing sites and unbounded 
numbers of speculative tasks. Changing the synchronic distance from 5000 to 
50.000 rule applications (in increments of 5000) led to a nearly linear reduction 
of control messages from some 260 to 24 (i.e., by about an order of magnitude) 
and of the program runtime by about 10%. 



6 Related Work 



In the functional domain, speculative evaluation primarily relates to non-strict 
arguments under a lazy regime, with haskell as the language of choice [Pa.rOI I 
IMat03J . In |Che94] are described several priority schemes for speculative tasks, 
up- and down-grading, and both explicit and implicit abortion of tasks that have 
become irrelevant. This work also addresses the problem of non-determinacy and 
suggests introducing the notion of bags in which solutions are collected without 
any ordering to maintain a functional semantics. 

Another interesting approach based on the speculative evaluation of lisp 
programs is reported in [lOsbQO] . It introduces a rather elaborate priority scheme 
based on a so-called sponsor model which favors the execution of speculative 
tasks with the best chances of succeeding. Sponsorships and, hence, priorities 
are dynamically adjusted as computations proceed and more information about 
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possible outcomes become available. The sponsor model implicitely also gets 
around the fairness vs. limited resources problem. 

In logic programming, the concepts closest to our approach relate to the 
exploitation of the so-called Oi?-parallelism in PROLOG-programs |Sha89| . for 
which scheduling strategies are described in | Bea91| . Interestingly enough, spec- 
ulative evaluation in PROLOG relates to computations that may be cut off by 
pruning operations such as the famous cut. Detecting this kind of speculative 
work is discussed in [Ha,u89j . 

Representative for related work on term-rewrite systems is |VK9fiJ . It de- 
scribes an interactive theorem prover called larch which allows the user to 
launch several speculative attempts to prove conjectures; unsuccessful attempts 
must be cleaned up by hand. Both AND- and OR parallelism are supported, 
and the number of competing tasks or the depth of the recursive expansion can 
be restricted. 



7 Conclusion 

The system concept described in this paper merely supports the measures that 
are considered absolutely essential to organize speculative computations in some 
orderly form. They mainly concern a reasonably efficient usage of resources (by 
giving mandatory tasks scheduling priority over speculative tasks and by throt- 
tling the number of speculative tasks that may participate in a computation), 
ensuring fair progress among all speculative tasks, and earliest possible sta- 
tus changes of speculative tasks either to aborted or to vital. The emphasis of 
this work was primarily on feasibility studies of various alternative solutions to 
these problems, less so on absolute performance figures (hence interpretation of 
abstract machine code which facilitated experimental implementations consid- 
erably). No attempts have been made to speculate, e.g., by means of more or 
less elaborate priority schemes, on the likely behavior of the program at hand, 
and to have task scheduling governed by these priorities, as for instance in the 
approaches reported in |Che94| and in | Osb90| . Concentrating on the essentials 
has nevertheless yielded encouraging results in terms of noticeable performance 
gains vs. sequential execution. 

Further work primarily concerns more refined scheduling strategies for specu- 
lative tasks which take into account several dynamically changing priority levels, 
improved mechanisms for passing control messages up and down hierarchies of 
speculative tasks, and compilation to target machine code. 
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Abstract. Sac is a functional array processing language particularly 
designed with numerical applications in mind. In this field the runtime 
performance of programs critically depends on the efficient utilization of 
the memory hierarchy. Cache conflicts due to limited set associativity are 
one relevant source of inefficiency. This paper describes the realization of 
an optimization technique which aims at eliminating cache conflicts by 
adjusting the data layout of arrays to specific access patterns and cache 
configurations. Its effect on cache utilization and runtime performance is 
demonstrated by investigations on the PDEl benchmark. 



1 Introduction 

Sac is a functional array processing language, which tries to combine generic, 
high-level program specifications with efficient runtime behaviour |20l^ . Par- 
ticularly in the field of numerical applications, the efficient utilization of the 
memory hierarchy plays a key role in achieving good performance m- However, 
for many numerical application programs it can be observed that small variations 
in problem sizes may have a significant impact on runtime performance. This 
is due to systematic cache conflicts which may occur for unfavourable combina- 
tions of array access patterns and array data layout in the presence of limited 
cache associativity [2|- 

Assuming the runtime performance of a program is poor for one problem 
size, but turns out to be significantly better for a marginally larger problem size, 
it is a rather straightforward idea to mimick the data layout associated with 
the larger problem size when actually dealing with the smaller one. In doing 
so, the originally dense representation of arrays is manipulated by the introduc- 
tion of dummy elements in one or another dimension, so-called array padding 
|T]. The array padding optimization implemented in Sac basically consists of 
three steps. First, Sac code within wiTH-loops, the predominant Sac language 
construct for the specification of aggregate array operations |^, is thoroughly 
analysed for array accesses, and the arrays involved are associated with accurate 
access patterns. Second, an inference heuristic estimates the cache utilization and 
identifies an appropriate amount of padding where necessary. Cache phenomena 
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such as spatial and temporal reuse are taken into account. Third, the data lay- 
out modification proposed by the inference heuristic is realized as a high-level 
transformation on intermediate Sac code. 

The remainder of this paper is organized as follows. After a more detailed 
problem identification in Section [2l Sections [3l IH and [5] describe the three steps 
of the implementation. Their effect on runtime performance is demonstrated by 
means of the PDEl benchmark in Section |6] Section |7] sketches some related 
work while Section [HI concludes. 



2 Problem Identification 

We have chosen the benchmark PDEl as an example in order to investigate 
and quantify the potential impact of the problem size on runtime performance. 
PDEl implements red/black successive over-relaxation on 3-dimensional grids. 
The benchmark itself as well as various implementation opportunities for Sac 
are discussed in |S]. In our experiments we have systematically varied the size 
of the 3-dimensional grid from 16^ until 528^ in uniform steps of 16 elements 
in each dimension. With double precision floating point numbers, this involves 
array sizes between 32KB and 1.1GB. All experiments have been done on a SUN 
Ultra Enterprise 4000 system. Figured] shows the average times required to re- 
compute the value of a single inner grid element. It can be observed that these 
times significantly vary for the problem sizes investigated. While 155nsec are 
sufficient to update an inner element of a grid of size 16^, it takes up to 866nsec 
to complete the same operation in a grid of size 256^. Although exactly the same 
sequence of instructions is executed for each inner grid element regardless of the 
problem size, the time required to do so varies by a factor of 5.6. 

Such extreme variations in runtime performance can only be attributed to 
different degrees of cache utilization caused by varying data layouts introduced 
by different problem sizes. In order to substantiate claims like this, the Sac 
compiler and runtime system are equipped with a tailor-made cache simulation 
feature. On demand, a trace of all array accesses during program execution is 
generated. This allows for a complete simulation of the cache behaviour, yielding 
statistical information regarding the effectiveness of cache utilization. Each pro- 
cessor of the SUN Ultra Enterprise 4000 multiprocessor system is equipped with 
a 16KB LI data cache and a 1MB L2 unified cache. Both are direct-mapped and 
use cache lines of 32 and 64 bytes, respectively. Figure El shows the percentage of 
LI cache hits for the various problem sizes investigated as well as the percentage 
of memory requests satisfied by any of the two cache levels. It actually turns out 
that the extreme performance variations observed in Fig. Ujlargely coincide with 
similar variations in simulated cache hit rates. 

The design of cache memories is essentially based on two assumptions: tem- 
poral locality and spatial locality 0 . A program exhibits temporal locality if it is 
likely that once a memory address is referenced in the code, it will be referenced 
again soon. Therefore, data is loaded into the fast cache memory in order to sat- 
isfy subsequent requests without slow main memory interaction. Spatial locality 
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means that once a memory address is referenced, adjacent addresses are likely to 
be referenced soon. For this reason, caches are internally organized in so-called 
cache lines, which typically comprise between 16 and 128 bytes of contiguous 
memory. All data transfers between main memory and cache involve entire cache 
lines rather than single bytes or words of memory. Application programs do only 
benefit from caches to the extent to which they exhibit spatial and temporal 
locality. 

However, spatial and temporal locality are mainly characteristics of a given 
program, and hence, do not explain the observed performance variations. In 
fact, it is a limitation in cache memory hardware that is responsible for this: 
very limited set associativity. In order to efficiently distinguish cache hits from 
cache misses, any given memory address can only be mapped to one of very 
few locations in the cache, which are directly derived from the memory address 
itself. Today’s caches usually provide set associativities between one and four. 
As a consequence, data may be flushed from the cache before potential reuse is 
actually exploited, although the cache is sufficiently large to allow the reuse in 
principle. These so-called conflict misses may seriously limit cache utilization, 
as can be seen in Figs. [U and [21 Since concrete memory addresses decide over 
cache conflicts, they are extremely sensitive against memory layout variations, 
in particular, whenever regularly structured data is accessed in regular patterns, 
which is typical for numerical codes involving large arrays. 

Various different cache effects have been identified [22j . e.g., a spatial reuse 
conflict occurs whenever not all array elements referenced in a single iteration of 
an inner loop can simultaneously be held in the cache. The number of different 
array elements which are mapped to the same cache set exceeds the cache’s set 
associativity and, hence, cache lines are flushed from the cache before potential 
reuse can be realized in the following iteration. A temporal reuse conflict occurs 
when potential reuse between two references to the same array element cannot 
be exploited because another array reference interferes and causes the first one 
to be flushed from the cache before the potential reuse actually occurs. Conflicts 
are classified as either arising from references to the same array, so-called self- 
interference conflicts, or to different arrays, so-called cross-interference conflicts. 

Thorough elimination of cache conflicts is crucial for keeping the runtime 
performance consistent over a range of problem sizes m- This can be achieved 
by a well-aimed manipulation of the data layout of arrays. Self-interference con- 
flicts can be eliminated by modifying the internal representation of arrays, cross- 
interference conflicts by adjusting array base addresses. The latter approach is 
very difficult to realize in a language like Sac, which allocates and de-allocates 
all data structures dynamically. Therefore, we concentrate on self-interference 
conflicts in the following. One way to manipulate the internal representation of 
arrays is array padding, a well-known optimization technique that adds dummy 
elements to an array in one or another inner dimension [T]. For example, an 
array whose original shape is [100,100] may be transformed into an array of 
shape [100, 102] by adding two columns of dummy elements. Padding an array 
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alters the memory addresses of different elements in different ways and, hence, 
allows to indirectly manipulate their associated relative cache locations. 

However, applying array padding manually has some serious drawbacks. It 
requires both a lot of effort and expert knowledge by programmers, who in this 
case are solely responsible to identify where which amount of padding might 
have a positive impact on runtime performance. Moreover, explicit array padding 
increases program complexity and makes programs less readable and more error- 
prone. Last but not least, array padding renders program specifications machine- 
dependent because each combination of problem size, access pattern, and cache 
configuration typically requires a different amount of padding. 

In contrast, array padding as a compiler optimization may be well-suited 
to achieve more consistent performance over a wide range of problem sizes and 
cache configurations. However, things are not as simple in low-level languages 
such as C or Fortran. Since these languages’ semantics guarantee a certain 
(unpadded) data layout, thorough program analysis is required in order to prove 
that padding does not alter the meaning of a program. Here, the design of 
high-level languages like Sac pays off. Since they completely abstract from any 
concrete data layout, language implementations are free to exploit the benefits 
of varying data layouts as an additional optimization technique. 



3 Array Access Analysis 

Accurate analysis of array access patterns is one of the prerequisites for reasoning 
about cache conflicts. Severe cache conflicts typically arise from regular array 
references within loops, i.e., two or more references systematically conflict with 
each other in every iteration of the loop. Therefore, the analysis described in 
this section focusses on regular array references in with- loops. The wiTH-loop 
is a SAC-specific language construct for the specification of aggregate multi- 
dimensional array operations; a thorough description may, for instance, be found 
in [7j. An array reference is considered being regular if and only if it can be 
written in the form 



val = Array [ s * i -|- d ] ; 

where s denotes a constant stride vector, d a constant offset vector, and i the 
WiTH-loop’s index variable. Note that * here denotes the element-wise product of 
two vectors. In other words, locations of regular array references are defined by 
dimension- wise affine functions of the wiTH-loop’s index variable. Figure EJshows 
an example wiTH-loop featuring a few different regular array references. All 
array references that cannot be converted to this affine pattern, are considered 
irregular. They are likely not to conflict in a systematic way with other references, 
irregular or regular. Therefore, they are just ignored in the sequel. 

All array references in the example shown in Fig. 0 are regular with respect 
to the above definition. This can be inferred during a rather simple bottom-up 
traversal of the wiTH-loop body. Compact array access information is accumu- 
lated, as outlined in Fig. U] The array access pattern AP is a set of triples; 



236 



Clemens Grelck 



int[100,100] A; 
int[200,150] B; 
int[120,120] C; 

A = with ([1,1] <= iv < [100,100]) 
{ 

a = B [ iv - 1] ; 
b = C [ iv] ; 
c = B [ iv + 2] ; 
d = C[ [42, 42]] ; 
e = B[ [2, 1] * iv] ; 
tmp = iv + [1 , 1] ; 
f = B [ [2 , 1] * tmp] ; 
val =a+b+c+d+e+f; 

} 

genarray ( [100, 100] , val); 



Fig. 3. Examples of regular array references in a wiTH-loop 
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Fig. 4. Array access pattern derived from example wiTH-loop in Fig. |3] 



each triple represents exactly one regular array reference found in the with- 
loop body. The access triples themselves consist of the name of the referenced 
array, the stride vector s and the offset vector d. 

As already pointed out, the technique presented in this paper focusses on self- 
interference cache conflicts, i.e. conflicts between references to the same array. 
References to different arrays, although occurring in a single WiTH-loop, may be 
handled separately. Furthermore, only array references which are characterized 
by identical stride vectors s may actually interfere with each other in a systematic 
and, hence, expensive manner. These considerations lead to the division of an 
access pattern into disjoint so-called conflict groups. Each conflict group then 
contains exactly one subset of array references which are likely to systematically 
interfere with each other. 

The example access pattern AV in Fig. 21 results in the introduction of four 
conflict groups, as outlined in Fig. El Each conflict group is represented by a pair 
consisting of the type of the referenced array and a sequence of offset vectors. The 
stride vectors are no longer needed. Whether or not two references of the same 
conflict group cause a cache conflict solely depends on their relative distance in 
memory, which is invariant against their strides. Last but not least, no cache 
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CQx = < int[200,150] , < [-1,-1], [ 2, 2] > > 

Ct /2 = < int [120,120] , < [ 0, 0] > > 

CC /3 = < int [120,120] , < [42,42] > > 

C04 = < int[200,150] , < [ 0, 0], [ 2, 1] > > 



Fig. 5. Conflict groups derived from access pattern AV in Fig.[^ 



conflicts may occur in conflict groups consisting of a single array reference only. 
As a consequence, all such conflict groups, e.g. CQ^ and CQz in Fig.0, are simply 
ignored. The number of conflict groups can further be reduced by the elimination 
of multiple occurrences of identical ones and of those that are subsets of others. 

4 Padding Inference Heuristic 

This section presents the central padding inference algorithm. It associates each 
array type occurring in a Sac program or module with a padding recommenda- 
tion appropriate for avoiding spatial and temporal self-interference cache con- 
flicts. The basic idea is to pad all arrays of a given type (consisting of base type 
and shape) in a uniform way if at all. This helps to avoid costly transforma- 
tions between unpadded and padded or even differently padded representations 
of arrays which originally had identical types and, hence, data layouts. Such 
transformations are limited to module boundaries, providing programmers with 
some means of control over array padding. 

In addition to the conflict groups implicitly derived from Sac code, as de- 
scribed in Section!^ the inference scheme presented here is based on the specifi- 
cation of a cache configuration, which must explicitly be stated at compile time. 
It consists of the cache size and the cache line size, both in bytes, as well as the 
cache’s set associativity. Furthermore, an upper limit must be set on memory 
consumption overhead caused by array padding. 

When focussing on a single array type, which consists of a scalar base type 
and an original shape SHP, we may easily compute the cache size CS and the 
cache line size CLS in array elements. These figures, rather than the external 
specifications in bytes, are used by the inference scheme. Moreover, we com- 
pute the number of cache sets, NSET := CS/{CLS * CA) where CA denotes 
the cache’s set associativity. With this internal cache specification at hand, all 
conflict groups associated with the array type under consideration are then suc- 
cessively analysed with respect to potential cache conflicts. Padding recommen- 
dations are accumulated in a vector PAD, which is initially set to 0, i.e., we 
start out with recommending no padding at all. 

First, spatial reuse conflicts are addressed. Let us consider a conflict group 
CQ representing array references R\, . . . ,Rn- For each reference Ri, the offset 
vector Di is converted into a scalar offset with respect to the array shape SHP 
extended by the padding vector PAD recommended so far: 



Vie {!,..., n} : OFFSET, ■= ADDR{ D, , SHP + PAD) 
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where ADDR{vec, shp) is a function that computes the offset of vec in the row- 
major unrolling of an array with shape shp, i.e. 

ADDR{vec, shp) := {veck * shpm) 

k—0 

For reasons of simplicity it is desirable to avoid negative offsets. Since our interest 
is also limited to relative distances of cache locations, computed offsets can easily 
be shifted by a constant value. The easiest way to avoid negative offsets is to 
generally arrange the elements of a conflict group in ascending lexicographical 
order with respect to their offset vectors, and to subtract OFFSET^ from each 
scalar offset, i.e. 

V i G {1, . . . , n} : OFFSET, := OEFSET, - OFFSET^ . 

With the shifted offsets at hand, we now determine the respective cache sets 

V z G {1, . . . , n} : SETi := (OFFSET,/CLS) mod NSET . 

For each reference Ri, we compute the number NPSCi of potential spatial reuse 
conflicts with other references. Two references Ri and Rj potentially conflict with 
each other if and only if 

{{\SETi- SETjl < 2 V {\SET,-SETj\ = NSET-1)) 

A {{OFFSET, -OFFSETjl > 2*CLS) , 

i.e., they reference non-adjacent memory addresses which are mapped to identical 
or directly adjacent cache sets. The latter serves as an additional buffer that 
allows to completely abstract from relative placements of references within cache 
lines. In a direct-mapped cache {CA = 1), any potential conflict actually is a real 
conflict. However, in general, a conflict occurs whenever the number of potential 
conflicts equals or exceeds the cache’s set associativity CA, i.e., the number of 
spatial reuse conflicts associated with each array reference is defined as 

VzG {!,..., n} : fVS'C'i := max(0, NPSCi — CA+1) ; 

the total number of spatial reuse conflicts within the conflict group is defined as 

n 

NSC:=J2 ^SCi . 

i=0 

If there are no conflicts, i.e., NSC = 0, we are done and PAD is the rec- 
ommended padding for this conflict group with respect to spatial reuse. If the 
number of conflicts is reduced relative to the best padding found so far, the 
current padding and the number of spatial reuse conflicts associated with it are 
stored as new currently best solution. As long as there are still conflicts, we try 
to solve them with additional padding, i.e., the padding vector PAD is to be up- 
dated. For this purpose, we first identify dimensions that are eligible for padding. 
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Assigning the index 0 to the outermost dimension and counting upwards, the 
minimum padding dimension is determined as MIN PAD DIM := d+ 1, where 
d is the outermost dimension with Di[d] ^ Dj[d] for any pair of conflicting ar- 
ray references Ri and Rj. The maximum padding dimension is simply chosen 
as MAXPADDIM := \SHP\ — 1. Among all eligible dimensions the outermost 
one is chosen, where {SHP + PAD)[d\ is maximal. This choice of PADDIM 
guarantees that padding overhead grows in minimal steps. Padding is preferably 
applied to outer dimensions in order to reduce the negative impact of the loop 
overhead introduced by it. 

The padding vector PAD is incremented by 1 in dimension PADDIM and, 
assuming this additional padding does not exceed the given limit on mem- 
ory consumption overhead, the cache behaviour is re-evaluated with this new 
padding vector as described so far. Otherwise, SHP is reset to 0 in dimen- 
sion MINPADDIM and, provided that MINPADDIM j MAXPADDIM, 
padding in the next dimension is increased by 1. The entire process is repeated 
until either all spatial reuse conflicts are eliminated or all padding vectors eligible 
with respect to the memory consumption overhead limit have been investigated. 
In the latter case, the best padding found during the process is stored as recom- 
mended padding. 

With spatial reuse conflicts eliminated as far as possible, we may now focus 
on temporal reuse conflicts. As a first step, we determine for each reference Ri 
if there is a chance for temporal reuse from reference Ri+i in the presence of 
simple cache capacity constraints. This is the case if and only if 

OFFSER+i - OFFSET, < {NSET - 2) * CLS . 

Note here that all references are sorted with increasing offsets. For each pair 
of adjacent references Ri and Ri+i which may benefit from temporal reuse, we 
then compute the number of potential temporal reuse conflicts NPTC. An array 
reference Rj, j ^ i A j ^ i + 1 represents a potential temporal reuse conflict if 
it is mapped to a cache set ”in between" those associated with Ri and Ri+i, i.e. 

{SET, < SETj) A {SETj < SET,+i) SET, < SET,+i , 

{SET, < SETj) V {SETj < SET,+i) SET, > SET,+i . 

In analogy to spatial reuse conflicts, the term ’’potential" is to be understood 
with respect to set associativity, i.e., the number of actual temporal reuse con- 
flicts NTC is defined as 

V i G {1, . . . , n} : NTC, := max(0, NPTCi -CA+l) 

for each reference and in total as 

n 

NTC := NTC, . 

i=0 

Whenever the current padding fails to eliminate all temporal reuse conflicts, 
a new padding vector candidate is determined in a similar way as for resolving 
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spatial reuse conflicts. However, eligible padding dimensions are restricted in 
a slightly different way. The minimum eligible padding dimension is deflned 
as MINPADDIM :=d+l, where d denotes the outermost dimension with 
Di[d] ^ Dj[d] ^ Dij^i[d] for any triple of conflicting array references Ri, Rj, and 
Ri+i- The maximum eligible padding dimension MAX PAD DIM is given as 
the outermost dimension d where Di[d] yf Di^i[d\ holds for the same references 
Ri and Ri+i as above. The basic idea behind these choices for MINPADDIM 
and MAXPADDIM is to select a padding dimension which, on the one hand, is 
sufficiently large so that the relative cache locations of adjacent references with 
potential temporal reuse remain untouched, but, on the other hand, is sufficiently 
small, so that padding actually alters the relative cache locations between these 
adjacent references and the conflicting reference in between. 

In contrast to the choice of a padding dimension for the elimination of spatial 
reuse conflicts, an eligible padding dimension to avoid temporal reuse conflicts 
not necessarily exists. In this case, array padding does not resolve this conflict, 
and the inference heuristic stops at this point. Otherwise, a new padding vector 
candidate is chosen exactly as in the context of solving spatial reuse conflicts and 
temporal reuse conflicts are re-evaluated iteratively until either all are eliminated 
or the padding overhead constraint is exhausted. 

An alternative implementation different from the above inference heuristic is 
to evaluate all potential padding vectors eligible with respect to the given con- 
straint on additional memory consumption. For each such padding vector, the 
number of spatial and temporal reuse conflicts as well as the associated over- 
head are computed. Afterwards, the padding vector which causes the minimal 
number of conflicts is selected. If there are several equally suitable padding vec- 
tors, the one which causes the least overhead is chosen. If there are still multiple 
candidates, the one which incurs the least padding in inner dimensions is taken 
eventually. While this alternative implementation is guaranteed to And the most 
suitable padding with respect to the number of cache conflicts, memory con- 
sumption overhead, and loop overhead, it generally requires considerably more 
computational effort. However, since this effort is made at compile time rather 
than at runtime, it may be tolerable in many situations. 



5 Padding Transformation 

The padding inference algorithm described in the previous section results in 
the definition of a function PadRype, which for each array type found in the 
program or module under consideration yields the recommended padded type. 
Types for which a manipulation of the internal data layout is not recommended 
are simply returned by PadType as they are. This section focusses on the actual 
realization of the padding recommendation, which in the sequel will be formalized 
by means of a transformation scheme APR. It defines a high-level source-to- 
source transformation on simplified and type-annotated intermediate Sac code. 
The former means that nested expressions are lifted to separate assignments to 
temporary variables; the latter provides a function Rype, which associates each 
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AVT\ rettypes fun ( args ) { vardecs instrs } Rest ] 
AVTl rettypes ] fun ( AVTl args ] ) { 
TZepArgsl args ] AVTl vardecs ] 
AVT\ instrs ] 

} AVTl Rest I 

AVTl 1 

VadTypef type | , AVTl 1 

AVT{ type argname , Rest | 

VadTypel type | argname , AVT\ Rest\ 



TZepArgs\ type argname , Rest ] 

=> type ^argname ; TZepArgsl Rest | 
=> TZepArgsl Rest ] 



AVTl ^yP^ varname ; Rest | 

VadTypel type ] varname ; 
type _varname ; AVTl Rast ] 
=> type varname ; AVTl Rast ] 



ToBeVaddedl type ] 
otherwise 

ToBeVaddedl type ] 
otherwise 



Fig. 6. Transformation scheme AVT on function definitions 



variable with a Sac data type. The transformation scheme AVT is based on two 
additional auxiliary functions: Shapel type ] yields the shape part of an array 
data type type as a vector, and ToBeVaddedl type ] decides whether or not a 
padding is recommended for a given type, i.e. 

ToBeVaddedl type ] VadTypel type ] yf type 

Figure |6] shows the effect of the compilation scheme AVT on function defini- 
tions. The formal parameters of a function are traversed, and whenever padding 
is recommended for a return or argument type, the original type specification is 
replaced by the respective padded type. A similar transformation is applied to 
the local variable declarations. As already pointed out in Section |4l the trans- 
formation of a padded array into its unpadded representation is necessary in 
certain situations, e.g. at module boundaries. Since we do not have any a priori 
knowledge as to whether or not such a transformation will actually be required, 
additional variable declarations are introduced for each padded original local 
variabl^. The same is done for padded formal parameters by means of the aux- 
iliary compilation scheme TZepArgs. 

The effect of AVT on applications of user-defined and of built-in functions 
is defined in Fig. |7] Whereas nothing is to be done in the case of locally defined 
functions, the application of an imported function may require a change in the 
representations of argument as well as of result arrays. This is described by the 
three auxiliary compilation schemes TZename, Vad, andUnVad defined in Fig. |S1 



1 



Superfluous variable declarations are eliminated by subsequent optimization steps. 
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AVT{ 



vars = fun ( args ) ; Rest | 
vars = fun ( args ) ; AVTl Rest ] 



AVTl vars = module: fun ( args ) ; Rest ] 

UnVad\ args ] 

TZename\ vars ]] = module: fun ( TZenamel args ] ) ; 

Vadl vars ] AVT\ Rest ] 

AVT\ var = dim( array ) ; Rest ] 

=> var = dim( array ) ; AVT[ Rest ] 

AVTl “ shape ( array ) ; Rest ] 

=> var = Shapel Typel array ] | ; — ToBeVadded\ Type\ array] | 

AVT[ Rest ] 

var = shape ( array ) ; AVT\ Rest — otherwise 

I 

AVT\ var = psi( array , vec ) ; Rest ] 

var = psi( array , vec ) ; AVTl Rvst ] 



AVTl 



var = modarray ( array , vec , val ) ; Rest ] 
var = modarray ( array , vec , val ) ; AVTl Rvst ]] 



AVTl “ reshape ( vec , array ) ; Rest | 

UnVadl array ] 

TZenamel var ] = reshape ( vec , TZenamel vtrray ]] ) ; 
Vadl var ] AVTl I 



Fig. 7. Transformation scheme AVT on function applications 



TZenamel var , Rest ]] 

=> .var , TZenamel Rest ] 
var , TZenamel Rest ] 

TZenamel const , Rest ] 

=> const , TZenamel Rest ] 

Vadl var , Rest ] 

=> var = Pad( .var ) ; Vadl Rest ] 
"Pod| Rest ] 

UnVadl var , Rest ] 

=> .var = UnPadC var ) ; Vadl Rest ] 
Vadl Rest ] 



UnVadl const , Rest ] 
Vadl Rest ] 



ToBeVaddedl Typel var] ] 
otherwise 



ToBeVaddedl Typel var] ] 
otherwise 



ToBeVaddedl Typel var] ] 
otherwise 



Fig. 8. Auxiliary schemes TZename, Vad, and UnVad 
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Sac supports only a very limited number of built-in operations on arrays. For 
instance, dim and shape retrieve an array’s dimensionality and shape, respec- 
tively. Since padding has no effect on dimensionality, any application of dim may 
simply remain as it is. In contrast, an application of shape must be replaced by 
the shape corresponding to the original type of the argument array. The function 
psi selects the element of array specified by the index vector vec. The offset in 
memory specified by vec is computed using the function ADDR{vec, shp) de- 
fined in Section |H However, this function also computes the correct offset of 
an array element in a padded array representation when providing the padded 
shape as second argument. Hence, no code transformation is required for the se- 
lection of elements regardless of whether or not an array is padded. The built-in 
function modarray yields an array that is identical to its first argument except 
for the element denoted by the second argument, which is replaced by the third 
argument. Since Type| var ] = Typel array ] and hence 

VadType\ Type\ war] | = VadType\ Type\ array ] ] , 

modarray can be applied to padded arrays without additional measures. The last 
remaining built-in function is reshape, which creates an array that consists of 
the same elements as the argument array, but is associated with the new shape 
defined by the argument vec. Applications of reshape are restricted to argu- 
ments where the given array’s original shape and the new shape are compatible, 
i.e., they refer to arrays with the same number of elements. However, as soon as 
one of the two shapes is padded, this restriction is violated. Even if both shapes 
are padded, it is rather unlikely that the padded shapes comply with the com- 
patibility restriction. As a way out, both the argument array as well as the result 
array have to be converted between padded and unpadded representations. 

The transformation of an array from a padded into an unpadded representa- 
tion or vice versa is subject to the three auxiliary compilation schemes TZename, 
Vad, and UnVad defined in Fig. Whenever a padded array is encountered 
where an unpadded representation is required, it is transformed by means of a 
predefined generic function UnPad. In a similar way, arrays which are created in 
an unpadded representation for some reason, but whose types are recommended 
to be padded according to VadType, are transformed into the corresponding 
padded representation using the predefined generic function Pad. 

Aggregate array operations are defined in one way or another by means of 
WiTH-loops in Sac itself. The effect of the compilation scheme AVT on with- 
loops is described in Fig. El Apart from recursively applying AVT to the in- 
structions within the body of a wiTH-loop, only a single code transformation is 
actually required. The expression that defines the shape of the result array in a 
genarray-WiTH-loop is replaced by the corresponding padded shape. 

Assuming a generator depends in one way or another on the shape of a 
padded array, all applications of the built-in function shape would have been 
abstracted out of the generator itself. These applications are then replaced by 
the original shapes of the arrays they refer to (see Fig. [ 7 ). As a consequence, 
array padding does not alter the generators of wiTH-loops in any way. Should 
padding apply to the result array of a genarray-WiTH-loop or modarray- W ith- 
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AVT\ var = with ( generator ) { instrs } genarrayC shp , val ) ; Rest ] 

=> var = with ( generator ) { AVTl instrs ] } 

genarrayC Shape\ Type\ var ] | , val ) ; AVTl 1 
AVT\ var = with ( generator ) { instrs } modarrayC old , iv , val ) ; Rest ] 
var = with ( generator ) { AVT{ instrs ] } 

modarrayC old , iv , val ) ; AVT{ Rest | 

AVT\ var = with C generator ) { instrs } foldC fun , neutral , val ) ; Rest ] 
var = with C generator ) { AVT\ instrs ] } 

foldC fun , neutral , val ) ; AVTf Rest ] 



Fig. 9. Transformation scheme AVT on wiTH-loops 



loop, the additional dummy elements are automatically initialized according to 
the default rule of the wiTH-loop without any additional measures required. 

While the padding transformation of wiTH-loops, as outlined in Fig. E] is 
simple and elegant on a conceptual level, it unfortunately introduces superfluous 
and avoidable runtime overhead. Initializing dummy array elements according 
to the wiTH-loop’s default rule leads to additional memory accesses that, by 
definition, do not contribute to the program result. This observation gives way 
to an additional optimization which distinguishes between dummy and regular 
array elements in the intermediate representation of wiTH-loops. The internal 
format of multi-generator wiTH-loops, as described in [7], provides a suitable 
framework for this purpose. 

6 Performance Evaluation 

Figure [To] shows the effect of applying the array padding optimization outlined in 
Sections [S] m and0to the PDEl benchmark. Given the same problem sizes as in 
the initial investigations described in Section [2| and the upper limit on memory 
consumption overhead set to 10%, the padding inference heuristic decides to 
pad 25 out of the total of 33 problem sizes under consideration. In 16 cases, 
it recommends a padding of [0,1,0] (32^, 96^, 160^, 224^, 272^, 288^, 304^, 
3303, 3683, 4oq3^ 4103^ 4333^ 4543^ 48q 3^ 4953^ 52g3) 7 padding 

of [0,2,0] (643, 1283, 1923, 2503, 3203, 3843, 4433) problem size 3523 

a padding of [0,22,0] and for 5123 g, padding of [0,5,1] is chosen. Figure Hn] 
shows the effect of array padding on the simulated cache performance of the 
PDEl benchmark. In fact, array padding succeeds in keeping the LI cache hit 
rate on a consistently high level between 84% and 88% across all problem sizes. 
It also manages to avoid the sharp drops in the overall cache hit rate, which can 
be observed for the problem sizes 2503 and 5123 |.pg original figures. 

Figure II 1 1 shows the effect of array padding on the runtime performance of 
the PDEl benchmark. First of all, it can be observed that for none of the prob- 
lem sizes the padding heuristic yields a performance degradation. In contrast, 
improvements can be observed whenever the padding transformation actually is 
applied, some of them being quite considerable. In particular, for the problem 
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sizes 64^, 256^, and 512^ the average time needed to re-compute a single grid 
element can be reduced by 53%, 64%, and 63%, respectively. Also, the variance 
in runtimes is significantly decreased. With array padding consistent runtimes 
are achieved over the whole range of problem sizes investigated. 

7 Related Work 

In most functional programming languages, lists rather than arrays are the pre- 
dominantly used data structure. The most prominent exception is the language 
Sisal. However, Sisal represents arrays as vectors of vectors rather than as con- 
tiguous data, and this storage format renders optimizations like array padding 
obsolete. So, we are not aware of any similar optimization technique in the area 
of functional languages. 

In high-performance computing based on imperative languages, still predom- 
inantly Fortran, data locality has long been identified as an important issue 
p3) . Much research has been focussed on program transformations that reorder 
the sequence in which single iterations within a nesting of loops are actually 
executed I5TT9TT21 . Loop transformations such as permutation, reversal, or in- 
terchange, are used to adjust the iteration order to a given array data layout 
in order to achieve unit stride memory accesses in inner loops and, hence, to 
exploit spatial locality. Loop tiling, also called loop blocking, is a combination 
of loop skewing and subsequent loop permutation. It seeks to improve temporal 
locality in loop nestings by reducing the iteration distance between subsequent 
accesses to the same array element mm- Moreover, loop fusion allows to 
exploit locality of reference across multiple adjacent loop nestings m- 

Often, superior cache performance can be achieved if both the iteration order 
as well as the memory layout are subject to compiler transformations. Examples 
are the combination of array transposition with loop permutation |3] or that of 
array padding with tiling in order to increase tile sizes and, thus, to reduce the 
additional loop overhead inflicted by tiled code Whereas these approaches 
mostly focus on capacity misses, conflict misses due to limited set associativity 
have been identified as another important source of performance degradation 
[12 2 j . Their quantification has been achieved by so-called cache miss equations, 
i.e. linear Diophantine equations, that specify the cache line to which an array 
reference in a loop will be mapped Due to the complexity and expense of such 
accurate investigations, simpler heuristics that address both self-interference as 
well as cross-interference cache conflicts in Fortran loop nestings, have been 
proposed recently |I6|I7|. 

8 Conclusion 

This paper presents an algorithm that successfully eliminates spatial and tem- 
poral reuse conflicts in Sac programs by implicitly adjusting array data layouts 
to access patterns and cache configurations. Cache simulation as well as runtime 
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performance investigations on the PDEl benchmark show that this optimiza- 
tion technique allows for substantial reductions in program runtimes for certain 
problem sizes and, moreover, achieves a decidedly more consistent runtime per- 
formance over a wide range of problem sizes. 



References 

1. D.F. Bacon, S.L. Graham, and O.J. Sharp. Compiler Transformations for High- 
Performance Computing. ACM Computing Surveys, vol. 26(4), pp. 345-420, 1994. 

2. B. Bershad, D. Lee, T. Romer, and B. Chen. Avoiding Conflict Misses in Large 
Direct-Mapped Caches. In Proceedings of the 6th International Conference on Ar- 
ehitectural Support for Programming Languages and Operating Systems (ASPLOS- 
VI), San Jose, California, USA, 1994. 

3. M. Cierniak and W. Li. Unifying Data and Control Transformations for Distributed 
Shared-Memory Machines. In Proeeedings of the ACM SIGPLAN Conference on 
Programming Design and Implementation (PLDP95), La Jolla, California, USA, 
1995. 

4. S. Coleman and K. McKinley. Tile Size Selection Using Cache Organization and 
Data Layout. In Proceedings of the ACM SIGPLAN Conference on Programming 
Language Design and Implementation (PLDI’95), La Jolla, California, USA, pp. 
279-290, 1995. 

5. D. Gannon, W. Jalby, and K. Gallivan. Strategies for Cache and Local Memory 
Management by Global Program Transformation. Journal of Parallel and Dis- 
tributed Computing, vol. 5(5), pp. 587-616, 1988. 

6. S. Ghosh, M. Martonosi, and S. Malik. Gache Miss Equations: A Gompiler Frame- 
work for Analyzing and Tuning Memory Behavior. ACM Transactions on Pro- 
gramming Languages and Systems, vol. 21(4), pp. 703-746, 1999. 

7. G. Grelck, D. Kreye, and S.-B. Scholz. On Code Generation for Multi- Generator 
WITH-Loops in SAG. In P. Koopman and C. Glack, editors, Proeeedings of the 11th 
International Workshop on Implementation of Functional Languages (IFL’99), 
Lochem, The Netherlands, selected papers. Lecture Notes in Computer Science, 
vol. 1868, pp. 77-94. Springer- Verlag, 2000. 

8. G. Grelck and S.-B. Scholz. HPF vs. SAG — A Gase Study. In A. Bode, T. Ludwig, 
W. Karl, and R. Wismiiller, editors. Proceedings of the 6th International Euro- 
Par Conference on Parallel Processing (Euro-Par’OO), Munich, Germany, Lecture 
Notes in Computer Science, vol. 1900, pp. 620-624. Springer- Verlag, 2000. 

9. J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Ap- 
proach, Second Edition. Morgan Kaufmann, 1995. 

10. M.S. Lam, E.E. Rothberg, and M.E. Wolf. The Cache Performance of Blocked 
Algorithms. In Proceedings of the fth International Conference on Architectural 
Support for Programming Languages and Operating Systems (ASPLOS-IV), Palo 
Alto, California, USA, pp. 63-74, 1991. 

11. N. Manjikian and T.S. Abdelrahman. Fusion of Loops for Parallelism and Locality. 
IEEE Transactions on Parallel and Distributed Systems, vol. 8(2), pp. 193-209, 
1997. 

12. K. McKinley, S. Carr, and C.-W. Tseng. Improving Data Locality with Loop 
Transformations. ACM Transactions on Programming Languages and Systems, 
vol. 18(4), pp. 424-453, 1996. 




248 



Clemens Grelck 



13. K. McKinley and O. Temam. A Quantative Analysis of Loop Nest Locality. In Pro- 
ceedings of the 8th International Conference on Architectural Support for Program- 
ming Languages and Operating Systems (ASPLOS-VIII), Boston, Massachusetts, 
USA, 1996. 

14. T. Mowry, M. Lam, and A. Gupta. Design and Evaluation of a Gompiler Algorithm 
for Prefetching. In Proceedings of the 5th International Conference on Architectural 
Support for Programming Languages and Operating Systems (ASPLOS-V), Boston, 
Massachusetts, USA, pp. 62-73, 1992. 

15. P.R. Panda, H. Nakamura, N.D. Dutt, and A.Nicolau. A Data Alignment Tech- 
nique for Improving Cache Performance. In Proceedings of the International Con- 
ference on Computer Design VLSI in Computers and Processors, Austin, Texas, 
USA, pp. 587-592. IEEE Computer Society Press, 1997. 

16. G. Rivera and C.-W. Tseng. Data Transformations for Eliminating Conflict Misses. 
In Proceedings of the ACM SIGPLAN International Conference on Programming 
Language Design and Implementation (PLDP98), Montreal, Canada, ACM SIG- 
PLAN Notices, vol. 33(5), pp. 38-49. ACM Press, 1998. 

17. G. Rivera and G.-W. Tseng. Eliminating Conflict Misses for High Performance 
Architectures. In Proceedings of the ACM International Conference on Supercom- 
puting (ICS’98), Melbourne, Australia. ACM Press, 1998. 

18. G. Rivera and C.-W. Tseng. A Comparison of Compiler Tiling Algorithms. In Pro- 
ceedings of the 8th International Conference on Compiler Construction (CC’99), 
Amsterdam, The Netherlands, Lecture Notes in Computer Science, vol. 1575, pp. 
168-182. Springer- Verlag, 1999. 

19. V. Sarkar and R. Thekkath. A General Framework for Iteration-Reordering Loop 
Transformations. In Proceedings of the ACM SIGPLAN Conference on Program- 
ming Language Design and Implementation (PLDI’92), San Francisco, California, 
USA, pp. 175-187, 1992. 

20. S.-B. Scholz. On Defining Application-Specific High-Level Array Operations by 
Means of Shape-Invariant Programming Facilities. In S. Picchi and M. Micocci, 
editors, Proceedings of the International Conference on Array Processing Languages 
(APL’98), Rome, Italy, pp. 40-45. AGM Press, 1998. 

21. S.-B. Scholz. A Case Study: Effects of WITH-Loop Folding on the NAS Bench- 
mark MG in SAC. In K. Hammond, T. Davie, and C. Clack, editors. Proceedings 
of the 10th International Workshop on Implementation of Functional Languages 
(IFL’98), London, UK, selected papers. Lecture Notes in Computer Science, vol. 
1595, pp. 216-228. Springer- Verlag, 1999. 

22. O. Temam, C. Pricker, and W. Jalby. Cache Interference Phenomena. In Pro- 
ceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of 
Computer Systems, Nashville, Tennessee, USA, pp. 261-271. ACM Press, 1994. 

23. M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. In Proceed- 
ings of the ACM SIGPLAN Conference on Programming Language Design and 
Implementation (PLDI’Ol), pp. 30-44, 1991. 




The Collective Semantics 
in Functional SPMD Programming 



John O’Donnell 

Computing Science Department, University of Glasgow, 
Glasgow G12 8QQ, UK 
jtodSdcs .gla.ac.uk 
http : //www. dcs . gla. ac.uk/~jtod/ 



Abstract. SPMD programs are usually written from the perspective of 
a single processor, yet the intended behaviour is an aggregate compu- 
tation comprising many processors running the same program on local 
data. Combinators, such as map, fold, scan and multibroadcast, provide 
a flexible way to express SPMD programs more naturally and more ab- 
stractly at the collective level. A good SPMD programming methodology 
begins with a specification at the collective level, where many signih- 
cant transformations and optimisations can be introduced. Eventually, 
however, this collective level program must be transformed to the indi- 
vidual level in order to make it executable on an SPMD system. This 
paper introduces a technique needed to make the transformation pos- 
sible within a formal framework: a special collective semantics for the 
individual level program is required in order to justify a transformation 
from the collective level to the individual level. The collective semantics 
defines the meanings of the collective communication operations, and it 
allows equational reasoning to be used for deriving and implementing 
SPMD programs. 



1 Introduction 

Two popular methods for writing parallel programs are expressing parallelism 
using eombinators (such as map and scan), and SPMD programming (which is 
commonly supported by commercially available parallel systems). These meth- 
ods, which will be described shortly, offer complementary advantages. 

It would be helpful to be able to use both styles in constructing parallel 
applications. The programmer could specify an algorithm at a high level using 
parallel combinators, and a variety of effective methods exist for improving such 
algorithms via program transformation. In order to make it executable on a 
real parallel machine, the program could then be transformed to the lower level 
SPMD style. This transformation might be performed either by a compiler or 
by the programmer, using formal equational reasoning. 

This paper identifies a difficulty, called the collective/individual equivalenee 
problem, which arises while transforming a parallel combinator program into 
an SPMD program, and it sketches an approach for solving the problem. The 
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difficulty, in a nutshell, is that part of the meaning of the SPMD program is 
defined implicitly by a nonstandard semantics, and this must be taken into 
account in order to establish the equivalence of the high and low level versions 
of the algorithm. The purpose of the paper is to point out the problem and 
an approach to solving it, but a complete and precise implementation of the 
proposed solution remains as future work. 

In Section the parallel combinator and SPMD programming styles are de- 
scribed, and the collective/individual equivalence problem is discussed. SectionE] 
then introduces a solution to the problem, using a restricted setting (purely local 
computation) to keep things simple. In Section S] we consider communications 
operations and a more realistic monadic coordination language, and Section 0 
concludes. 

2 Two Styles of Parallel Programming 

Parallel programming languages, for both the parallel combinator style and the 
SPMD style, consist of two parts: 

— A set of parallel operations. For high level programming with parallel com- 
binators, every parallel operation is expressed directly as a combinator. For 
SPMD programming, different mechanisms are used, depending on whether 
the parallel operation is purely local or uses interprocessor communication. 

— A coordination language, which expresses the algorithm as a sequence of 
parallel operations. For a functional language with parallel combinators, the 
coordination language could comprise the entire functional language (e.g. 
Haskell), but it could also be restricted to functions written in a particular 
form. For conventional SPMD programming, the coordination language is C 
or Fortran. 

We will first describe the combinator and SPMD styles in more detailed, and 
then discuss the collective/individual equivalence problem. 



2.1 Parallel Combinators 

The combinator method for expressing parallelism is well suited for abstract, 
high level specifications of algorithms. A family of functions, such as map, fold 
and scan, is used to express parallel computations. For example, a set of par- 
allel local computations using the same function f can be expressed as pmap 
f xs, where the data structure xs is distributed among the memories of the 
parallel processors, and the individual function applications f xi, where xi is 
one of the elements of xs, are executed simultaneously in different processors. 
This style of programming is convenient for many applications. There is a rich 
set of mathematical laws relating the combinators, making this approach well 
suited for formal reasoning as well as a variety of optimisations and program 
transformations. 
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Fig. 1. Distributed memory machine 



2.2 SPMD Programs 

SPMD (single program, multiple data) is another popular model for program- 
ming parallel computers |1I6| . Many commercially available parallel systems sup- 
port the SPMD style, and it offers relatively good program portability. The idea 
behind SPMD is simply that the programmer writes a program that will run 
on one processor, but the parallel operating system executes the program by 
loading a copy into all the processors, which then execute it concurrently. The 
term SPMD is apt because the processors all use the same program, but they 
normally have different data in their local memories. 

SPMD programs are usually written in an imperative language, such as C 
or Fortran, using a standard communications library, such as MPI jSI7lj . The 
resulting language is often referred to as C-I-MPL The most characteristic at- 
tribute of conventional SPMD programming style is that the program is written 
from the viewpoint of a single processor, yet the programmer must bear in mind 
that there are actually many. For example, suppose that the aim is to perform 
a computation on each element of a vector, expressed as a sequential iteration: 

for (i=0; i<n; i++) 

{ y [i] = sqrt (x[i] ) ; } 

This program fragment requires x and y to be declared as arrays, and it would 
be executed sequentially. To introduce SPMD parallelism, we can distribute the 
arrays across the processors, so that each processor Pi computes y[i] using the 
value of x[i\ in its local memory. To express this, however, we don’t use arrays 
at all; instead, x and y are declared as singletons, and the parallel iteration is 
expressed using an ordinary assignment statement: 

y = sqrt (x) ; 

Interprocessor communication is generally achieved through collective com- 
munication operations, which are defined in a library such as MPI. Typical 
examples are fold (also called reduce) and scan (also called multiprefix). Con- 
ceptually, collective communication operations make sense only at the collec- 
tive level, since they inherently require information from all the processors. In 
a C-I-MPI program, however, they must be expressed at the individual level. 
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Therefore a collective communication operation is performed by arranging for 
each processor to call the same system library procedure, providing a local sin- 
gleton argument. The parallel computer’s interconnection network (Figure [TJ 
hardware and software perform the operation, and they return a singleton value 
to each processor. 

SPMD programs can be written in a functional language as well as in C, by 
treating the program as a sequence of operations. The standard MPI procedures 
can be called directly from a Haskell program, using a foreign language interface 

0 . 

2.3 Compilation via Program Transformation 

One way to proceed would be to define the collective communication operations 
as parallel functions, perhaps using Glasgow Parallel Haskell and algorithmic 
strategies 0. However, that does not lead to an SPMD program. The aim here 
is different: we want to specify the program abstractly with combinators at the 
collective level, and then to transform it down to the individual level for parallel 
execution. 

Most people who use SPMD systems simply write the final code in C-I-MPI. 
The drawback of SPMD programming at the individual level is that it hinders 
reasoning about programs. Methodologies for deriving high performance pro- 
grams, such as TwoL 0 and Abstract Parallel Machines [3], require the ability 
to express algorithms abstractly and to transform them. The programmer needs 
to think at the collective level, not the individual level. Furthermore, there is a 
rich algebra for the standard collective communication operations, particularly 
the family of map, fold and scan functions, which have proven effective for a 
variety of program transformations and optimisations. These algebraic laws op- 
erate at the collective level, and unfortunately cannot be applied to conventional 
SPMD programs at the individual level. Merely expressing the low level program 
in Haskell rather than C is not enough; we need a way of writing collective level 
programs and transforming them down to the individual level. 

Compilation is the transformation of a program written in a high level no- 
tation into an executable low level form. This is usually viewed as a completely 
automatic “black box” process. An alternative view — compilation as a sequence 
of program transformations — offers a range of potential benefits. The ghc com- 
piler for Haskell consists of a sequence of program transformations, using typed 
intermediate languages. This organisation makes it easier to incorporate pro- 
gram analysis and optimisation techniques, which are often expressed as trans- 
formations. It also reduces the gap between the structure of a compiler and the 
framework that would be needed to carry out a correctness proof. Even without 
doing such a detailed proof, it is arguably better to organise a large software 
project in a way that links more strongly with semantics. 

Fully automatic compilers are limited to optimisations that can be discovered 
and proved correct by algorithms. Yet many useful optimisations can be discov- 
ered by clever programmers, but are too subtle for compilers to find. This may 
be one reason for the continued popularity of low level languages like Fortran for 
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applications that require high performance: the more abstract languages cannot 
offer equally good performance, because their compilers are not always able to 
find the most efficient way to implement the algorithm. 

A potential solution to these problems is to combine the programmer’s in- 
sight into how an algorithm could run efficiently with the ability of software 
tools to support straightforward transformations. In effect, this approach would 
lead to compilers even more open than ghc, where the results obtained by auto- 
matic compilation could be used when they are good enough, and they could be 
improved by the programmer when he or she has an idea for an effective opti- 
misation and the extra work seems justified by the need for better performance. 

2.4 The Equivalence Problem 

A serious technical problem arises when transforming a parallel combinator pro- 
gram into its SPMD equivalent. The difficulty is that the two versions of the 
algorithm are essentially the same — they will yield the same results — yet math- 
ematically they are not at all the same! For example, the parallel computation 
expressed by pmap f xs would be transformed into f x in the SPMD version, 
yet the equivalence equation 

pmap f xs = f X 

is simply untrue. 

The essence of the problem is that the collective level program defines the 
behaviour explicitly, while an SPMD program leaves part of the behaviour im- 
plicit. Consequently there is no way to transform a program from the collective 
level to the individual level, using correctness preserving transformations and 
equational reasoning. 

We propose a solution to the problem, comprising three components: 

1. A special combinator-to-SPMD transformation rule is introduced. This rule 
is used exactly one time when transforming the high level algorithm to the 
target executable program, and that step is the one where the algorithm 
changes its view from the collective level to the individual level. 

2. A new nonstandard collective semantics is introduced; this gives the meaning 
that an SPMD program will have when run in the intended way, on an SPMD 
system. For example it defines the meaning of a singleton computation / x 
to be map f xs, provided that a; is a component of an aggregate xs that is 
distributed across the processors. 

3. The soundness of the combinator-to-SPMD transformation is established 
formally, through equational reasoning, using both the standard and the 
nonstandard semantics. The soundness theorem states that the collective 
semantics applied to the SPMD program gives the same function as the 
standard semantics applied to the combinator program. 

The remainder of this paper sketches how these components work, but full def- 
initions of the transformation rule and the collective semantics, and a proof of 
the soundness theorem, remain as future work. 
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It is worth noting that the problem identified here results from combin- 
ing the techniques of combinator parallelism, SPMD programming, and formal 
program transformation via equational reasoning. A programming methodology 
that omits any of these elements will not run into the difficulty at all: 

— Parallel applications are often written directly in C-I-MPL A good long term 
aim for parallel functional programming researchers would be to give con- 
vincing evidence that better results can be obtained by starting with com- 
binator parallelism. One way to show that would be to compile the parallel 
combinator program to SPMD form automatically, thus saving the program- 
mer’s time. Another way would be to improve the application’s efficiency 
using program transformations at the combinator level. 

— Formalisms such as algorithmic skeletons and BMF are unable to express 
SPMD algorithms in the first place, so issues about transforming skele- 
ton/BMF programs into SPMD do not arise. The only way to run a skele- 
ton/BSP program on an SPMD system is to translate it to machine code in 
a black box compiler, while the purpose of this paper is to show how such 
programs can be transformed into the machine code in a sequence of steps 
justified by equational reasoning. 

Semantic definitions provide the formal justification for program transfor- 
mations. Ordinarily, we use the standard semantics S, where S [[ P ] gives the 
meaning of the program P. One program can be transformed to another if they 
both have the same meaning under the standard semantics: 

Definition 1 . // S | Pi ] = S | P2 1, then Pi is transformable to P 2 under the 
standard semantics. This is written as P\ P2. 

A transformation from a program P\ expressed at the collective level to the 
corresponding program P2 expressed at the individual level must be treated 
differently, since S |Pi] |P2l- First we introduce a nonstandard collective 
semantics C that gives the “real” meaning of the SPMD program. That is, C [[Pj 
is the function computed by the entire SPMD system when each of its processors 
is executing a copy of P. Note that § | P ]] is the behaviour of an individual 
processor. Now we can introduce a special transformation that enables us to 
move from the collective level to the individual level: 

Definition 2 . // § | Pi ] = C [[ P2 ] then P\ is transformable to P2 under the 

c 

collective semantics. This is written as Pi P2. 

In a formal methodology for deriving parallel programs, we would start with 
an abstract specification at the collective level, and perform a sequence of ordi- 
nary transformations using Such transformations could improve performance, 
introduce decisions about how to organise the computation, bring the program 

closer to a low level implementation, and so on. When we need to make the jump 

c . 

to the individual level, we use which can be used only one time in a deriva- 
tion. Further ordinary transformations can then be applied to the individual 
level program, perhaps to introduce low-level optimisations. 
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3 Local Computation with Compositional Coordination 

In this section we consider the simplest case, where the only parallel operation 
is local computation, corresponding to parallel map, and the coordination lan- 
guage is pure function composition. This simplification is unrealistic for most 
applications, but it clarifies the distinction between parallel combinators and 
SPMD, and is simple enough for a fully detailed presentation here. 

Let a be the type of the state of an individual processor. The state of the 
entire system is modelled by a value of type [a] . This is not an arbitrary Haskell 
list; the list must have exactly one element for each processor, and there is no 
concept of sharing of lists elements. Strictly speaking, we should use a special 
finite sequence data type, but it is convenient to use lists so that standard func- 
tions like map can be applied. It is not intended that the list will be represented 
using the standard box and pointer data structure, and many of the properties 
of Haskell lists, such as sharing and infinite lists, will also be avoided. 

The only parallel operation allowed is one where every processor applies the 
same function / to its state, resulting in a new state. The behaviour of the entire 
system is then map f xs, where xs is the list of processor states. 

The coordination language is the simplest possible, pure function composi- 
tion. The parallel program thus consists of a sequence of state transition func- 
tions to be executed. Such a program is expressed as the forwards composition 
of the operations. The program will look more natural when expressed with the 
forward composition operator (@): 

(@) :: (a — >■ 6) — >■ (& — >■ c) — >■ a — >■ c 
lf@g)x = g if x) 

In general, the type of the state could change with each operation, but let 
us assume here that each processor has the same state type a, and that each 
operation leaves the state type unchanged. 

We call this level of abstraction, where the program specifies the computa- 
tion of the entire aggregate, the collective level. Consider a parallel program pc, 
expressed at the collective level, which specifies that all the processors apply /i 
to their states in parallel, followed by parallel applications of /2 and f^. This 
program has the form: 

pc :: [a] — >■ [a] 

pc = map fi @ map f -2 ® map /s 

A program that describes what one individual processor does is said to be 
expressed at the individual level. The actual code that runs on the parallel com- 
puter must be expressed at the individual level, and the aim of our transfor- 
mational methodology is to translate the abstract specification (collective level) 
into this executable form (individual level). The program pc corresponds to the 
individual level program pi: 

pi :: a ^ a 

pi = fi @ f2 @ h 
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Thus for programs that perform purely local computations, the only differ- 
ence between the collective and individual levels is a map that expresses the fact 
that many processors are doing the same thing. 

3.1 Transformation 

As the previous example suggests, all we have to do in translating the pro- 
gram is to remove the maps. (Naturally, the translation is more complex when 
communication operations are introduced, or when the coordination language is 
enhanced.) The program transformation is defined as follows: 

Definition 3. Let P he a program expressed at the collective level, within the 
pure function composition coordination language. Then the collective to individ- 
ual transformation T| P ] is defined by 

TI/i@/2l = TI/il@T[[/2l 

T I map / ]] = / 

// T I Pi ] = P 2 , we write Pi P 2 . 

3.2 Collective Semantics 

The meaning of the collective level program is just given by its standard se- 
mantics. If it is written in Haskell, we can just compile it and the compiler 
will produce code that computes the right value. However, the meaning of the 
individual level program P running on an SPMD system is not given by the 
standard semantics. Therefore we need to define a special collective semantics C 
that gives the real (collective) meaning of P. This is defined as follows: 

Definition 4. Let P he a program expressed at the individual level, within the 
pure function composition coordination language. Then the collective semantics 
C I P ] is defined by 

CI/l@/2l = CI/il@CI/2l 

C|fl = mop (S[/]) 

Again, the simplicity of this definition results from the lack of communication 
operations. Note that if P :: a — >■ a, then C | P ] :: [a] — >■ [a]. 

3.3 Soundness 

Clearly a collective level program pc and its individual level translation pi rep- 
resent different mathematical functions, and they have different types. However, 
when pi is executed on the processors of an SPMD system, it behaves just the 
same as pc. The collective to individual transformation is said to be sound if the 
individual level program has — on an SPMD system — the same behaviour as the 
collective level program using the standard semantics. The following straightfor- 
ward theorem states that T, as defined above for the pure composition coordi- 
nation language, is sound. 
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Theorem 1. Let 

/i,... ,/„ :: a-J> a 

pc = map fi @ ■ ■ ■ @ map fn 

such that pc is well typed. Then 

c inpcjj = § [pel. 

Proof. Structural induction over the program text. For the base case, let f be a 
piece of program text denoting a basic function; then 

C I T[[ map f 1 1 
= c Ifl 

= map (§ |f 1) 

= S I map f ] 

In the inductive case, assume the theorem holds for f and g. Then 

C ITIf ® gll 
= C IT[[fl®TIgll 
= C IT[[fll®C ITIgll 
= § If 1 ® S [[g]] 

= S If ® gl 

What this theorem gives us is a formal justification for performing the col- 
lective to individual level transformation. Without the theorem, it is impossible 
to use ordinary equational reasoning to make that step. 

4 Collective Communications and Monadic Coordination 

Although the language presented in the previous section is suitable for a few 
simple pipelining problems, it is inadequate for nearly all practical parallel pro- 
gramming. A crucial limitation is that there is no provision for interprocessor 
communication, and a less severe problem is that pure function composition is 
inconvenient as a coordination language. Both of those problems are addressed 
in this section: collective communication operations are incorporated, and a 
monadic coordination language is provided. Again, we need a transformation 
from the collective to the individual level, a collective semantics for the individ- 
ual level, and a soundness theorem. These will only be sketched in this paper, 
and a complete specification and proof are left for future work. 

4.1 Collective Communication Operations 

A collective communication operation involves all the processors in the system. 
A typical example is a scatter operation, in which one processors produces a 
vector of values [xq,xi, . . . ,xp_i], where there are P processors; the result of 




258 



John O’Donnell 



the scatter operation is that processor Pi receives Xi. A similar (but opposite) 
operation is gather, where one processor builds a vector in its local memory, 
where the ith element is provided by processor i. 

The essential point about collective communications is that all the proces- 
sors must participate in it at the same time, and they must all agree on what 
operation is being performed. This is quite different from random point-to-point 
message passing, where each processor can do what it likes, when it likes. Col- 
lective communications fit well with SPMD programming, since the constraint 
that every processor is running the same program makes it feasible to ensure 
that the processors stay synchronised sufficiently to ensure that they all agree 
about what communications operation to perform next. SIMD systems also pro- 
vide collective communications, since the hardware requires every processor to 
be executing the same instruction as all the others. However, the SPMD style 
allows the simplicity of the data parallel programming model, while still allowing 
the processors to execute different instructions. 

A particularly interesting class of collective communications involve compu- 
tation as well as communication. For example, the fold (or reduce) operation 
requires every processor to send a data value into the network, which must then 
use a binary operator / to combine pairs of values, eventually producing a sin- 
gleton result. 

In this section we consider just three collective communication operations, 
which are sufficient to illustrate how the collective to individual transformation 
works in the presence of communications. These operations are pfold, piscanl, 
and piscanr. 

The parallel fold function pfold combines the elements of a list using a 
function f. In order to allow a log time parallel implementation, f is assumed 
to be associative; for this reason its type is (a — >■ a — >■ a) rather than the more 
general types permitted for sequential folds. 

pfold : : (a->a->a) -> [a] -> a 

The intention is that a parallel system will implement pfold in log time using 
a tree-structured computation. One way to achieve this is to write a conventional 
compiler that translates a pfold application into the necessary machine code. 

Our purpose here, however, is to investigate how to transform the high level 
program into a low level SPMD program expressed in the same functional lan- 
guage. At the high level, we specify the semantics of pfold using the standard 
foldll function; this suffices for rapid prototyping as well as formal reason- 
ing. Thus the abstract definition of pfold should be viewed as a mathematical 
definition, which enables the high level program to be executed directly. 

pfold = foldll 

Scan is a generalisation of fold: when f is scanned over a list, pfold gives 
the final result and scatn produces a list of all the intermediate results. Again, 
the function f is assumed to be associative. A list can be scanned from the 
left in parallel (piscanl, giving the intermediate results of foldl) or from the 
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right piscanr, giving the intermediate results of foldr). These are both inclu- 
sive scans; i.e. the first element of their result includes the first element of the 
argument, and there is no singleton accumulator argument. 

pisccoil, pisccoir :: (a->a->a) -> [a] -> [a] 
pisccuil f xs 

= [foldll f (take (i+1) xs) 

I i <- [0 . . length xs - 1] ] 
piscEoir f xs 

= [foldrl f (drop i xs) 

I i <- [0 . . length xs - 1] ] 

The mathematical specifications of these functions are expressed with list 
comprehensions, since experience has shown this to give a clear and direct speci- 
fication suitable for equational reasoning. The specifications are also executable, 
so they can be used for rapid prototyping. However, they are inefficient, as 
they require O(n^) time to scan a list of length n, while there are sequential 
accumulator-style definitions that require only 0(n) time, and the parallel im- 
plementation is O(logn). However, efficiency is less important than clarity for 
semantic specifications. 

4.2 A Monadic Coordination Langnage 

Practical parallel programs use many variables, and the collective communication 
operations will typically access and update just a few of these. It is possible, but 
awkward, to express such programs using function composition to coordinate the 
operations. To do this, we would need to define two helper functions for every 
variable, one to access the processor state and fetch the variable’s value, and the 
other to update the state with a new value for that variable. 

A much more convenient approach is to use a monadic coordination lan- 
guage. This enables us to bind a name to the result returned by an operation, 
it can handle the state transitions, and it can also provide Input/Output op- 
erations. Monadic collective communication operations for the fold, piscanl 
and piscanr functions can be defined as 10 operations that return the result 
specified by the corresponding function: 

opfold : : (a->a->a) -> [a] -> 10 a 
opfold f xs = return (pfold f xs) 
opiscanl, opiscanr : : (a->a->a) -> [a] -> 10 [a] 
opiscanl f xs = return (piscauil f xs) 
opiscanr f xs = return (piscanr f xs) 

4.3 Example: Maximum Segment Sum (MSS) 

The standard Maximum Segment Sum problem will be used to illustrate the 
programming style at the collective and individual levels. The problem is to find 
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the largest possible sum of a sequence of contiguous numbers within a list of 
numbers. One could derive a solution to the problem from first principles, using 
a sequence of standard transformations, but that is not the point of this paper. 

Readers interested in how MSS works are referred to |1I3| . The algorithm mss 
can be expressed at the collective level, with monadic coordination, as follows: 

mss : : [Int] -> ID () 
mss xs = 

do ss <- opiscanl (+) xs 
ms <- opiscanr max ss 

let bs = [m-s+x I (m,s,x) <- zip3 ms ss xs] 
r <- opfold max bs 

putStr ("result = " ++ show r ++ "\n") 
return () 

For example, running 

testl = mss [2,3,-50,20,30,-100,20,-19,21,22,23,-100,60] 

produces 67, since the largest segment sum is 20 H — 19 + 21 + 22 + 23: extending 
the segment either to the left or right will make the result smaller, because of 
the big negative numbers, but it’s worth including the -19 in order to include 
also the 20. The intermediate lists computed for this example are: 

ss = [2,5,-45,-25,5,-95,-75,-94,-73,-51,-28,-128,-68] 
ms = [5, 5, 5, 5, 5, -28, -28, -28, -28, -28, -28, -68, -68] 
bs = [5,3,0,50,30,-33,67,47,66,45,23,-40,60] 
result = 67 

4.4 Transformation 

We must now transform the collective level program mss into the individual level, 
where the collective communication operations are performed in three steps: 

1. The processor sends a request to the interconnection network, specifying 
which collective communication operation is to be performed, and supplying 
the data values contributed from the processor’s local memory. 

2. The interconnection network synchronises; that is, it waits until all proces- 
sors have made a request. It checks that all processors have requested the 
same operation; if not a fatal error has occurred. Otherwise, the network 
performs the operation, which in general involves both communication and 
computation. In the case of scan, for example, the network is responsible for 
arranging the necessary applications of /; it is immaterial where the actual 
work is performed physically — the network could execute the applications, 
or it could have the processors do that work. 

3. Finally, the interconnection network packages the results of the operation 
into a set of replies, one for each processor. The processors receive their 
replies, which will normally contain data values, at which point they can 
resume their computations. 
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The transformation to the individual level is straightforward. The type of 
the program has changed from [Int] -> 10 () to Int -> 10 (), and each 
collective communications operation is replaced by an operation with a similar 
name which requests the operation. The local computation that was expressed 
by a list comprehension (essentially, by a map) becomes just a local singleton 
computation, expressed as let b = m-s+x. 

mss_ind ; : Int -> 10 () 
mss_ind x = 

do s <- req_opiscanl Plus x 
m <- req_opiscanr Max s 
let b = m-s+x 
r <- req_opfold Max b 

putStr ("result = " ++ show r ++ "\n") 
return () 

A request to perform a collective communication operation contains a tag 
specifying which operation is being performed, and all necessary arguments. 
All of the collective communications requests are produced by a generic ouput 
operation putReq, which takes the operands of the operation, packages them into 
a data structure representing the request, and outputs it to the network. From 
the viewpoint of the individual processor, putReq is simply on I/O operation. 

req_opiscanl , req_opiscEuir , req_opfold :: FunRep -> Int -> ID Int 
req_opiscanl f x = putReq Req_opiscanl f x 

req_opiscanr f x = putReq Req_opiscanr f x 

req_opfold f x = putReq Req_opfold f x 

A tag of type Request indicates which operation is to be performed. Natu- 
rally, the system supports a fixed set of operations, so it is convenient to represent 
them with an enumerated type: 

data Request 

= Req_opisccUir I Req_opiscctnl I Req_opfold 
deriving Show 

The functional argument f to the fold and scan functions must also be in- 
cluded in the request. Several different approaches have been taken on real par- 
allel systems for representing functional arguments. Some systems, such as the 
programming languages for the Connection Machine, restrict such functional 
arguments to a fixed set. This restriction is appropriate when the function ap- 
plications will be performed by dedicated hardware within the interconnection 
network, and that hardware is limited to a small fixed set of functions. In such 
systems, it is natural to represent the function to be represented as another 
enumerated type: 

data FunRep 

= Plus I Times I Max I Min 
deriving Show 
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However, this Draconian restriction prevents many useful programs from be- 
ing expressed. A better alternative is to provide actual executable code as the 
function representation, but this raises several further problems: for instance, 
some computers use special hardware in the interconnection network to perform 
the function applications needed by fold and scan, and ordinary compiled func- 
tions won’t run in the network nodes. Allowing an arbitrary function / to be 
specified is more complicated to implement, but it greatly enhances the system’s 
flexibility. In the remainder of this paper we take the simpler approach, where 
functions must be represented by FunRep. 

In order to make the individual level program executable on an ordinary 
workstation, putReq can be defined simply to output a string to the standard 
output channel, and to return a value parsed from a string that is read in. The 
definition below can be used for testing the individual level program, but the 
parallel system will provide its own 10 channel for collective communication 
requests, and will have its own binary format for representing the requests and 
responses. 



putReq : : Request -> FunRep -> Int -> 10 Int 
putReq r f X = 

do putStr (show r ++ " " ++ show f ++ " " ++ show x ++ "\n") 
y <- getint 
return y 



c 

Now we can outline the transformation which changes a program from 
the collective to the individual level. The transformation has to replace each col- 
lective communication combinator operation with the request to perform that 
operation. In doing so, the functional argument f is replaced by its representa- 
tion of type FunRep; this is performed by the F transformation. Furthermore, 
the type of the program is changed, to reflect the aggregate input. Parallel maps 

are replaced by singleton computations, and this must be done for list compre- 

C 

hensions as well as direct applications of map. A sketch of the definition of is 
given below, but there are a number of details that are omitted here. 



pc : : [a] -> ID () 
let ys = map f xs 
ys <- opiscanl f xs 
ys <- opiscanr f xs 
ys <- opfold f xs 



-£> pi : : a -> ID 0 
let y = f X 

y <- req_opiscEuil (F[[f]]) x 
-£> y req_opisccUir (F[[f]]) x 
y <- req_fold (F|fj) x 



There is one subtle point to note about the parallel fold operation, opfold. 
Ordinarily, fold functions return a singleton result, while scan functions return 
a list. Data parallel programs usually follow the same convention, and there is 
a unique control processor that receives the singleton result of operations like 
fold. In SPMD programming, however, there is no special control processor, so 
there is no obvious unique place to send the result of a fold. One approach would 
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be to choose one particular processor to be the recipient of the result (just as 
one processor is chosen to receive the results of a gather operation). Here, we 
take a different approach: the fold operation produces a singleton result, which 
is broadcast by the network to every processor. Thus the fold and scan functions 
have the same type: both produce aggregate (e.g. list) results, and the only 
difference is that with fold, all the elements of the list are the same. 

4.5 Collective Semantics 

The individual level program can be expressed in Haskell, and it has a standard 
semantics provided by the Haskell compiler. This can be executed directly on the 
system hardware; thus we can transform a program from a high level collective 
specification all the way down to executable parallel code, using formally justi- 
fied transformations throughout. The final individual level program is definitely 
sequential, but instead of performing normal Input/Output operations to ordi- 
nary peripherals, it does its I/O to the interconnection network, “outputting” 
requests for collective communications, and “inputting” the results. From the 
point of view of the hardware, it is just an ordinary sequential imperative pro- 
gram. 

The nonstandard collective semantics must model the behaviours of all the 
processors, as well as the interconnection network. Recall that in Section HO we 
used a collective semantics that modelled the set of processors with the map 
function. Since there was no communication, nothing had to be done about 
the interconnection network. Now, however, the collective semantics needs to 
perform a full simulation of the parallel system in order to define the meaning 
of an individual level request for collective communications. 

The simulation can be done in several ways; one approach is sketched here. 
Consider how we can define the collective meaning of 

y <- req_opiscanl f x 

The simulation is defined monadically, as a collective operation that performs 
the following sequence of steps: 

1. Each processor is executed in turn, and its request operation is redefined so 
that it saves the data provided by the processor into an aggregate structure. 

2. Now that all the inputs to the interconnection network are known, the be- 
haviour of the network is simulated using the appropriate combinator (for 
example, piscanl). This will produce a list of responses. 

3. Each response in the list is made available to the corresponding processor, 
as the result of its operation. 

It is interesting to consider exactly where each part of the computation takes 
place. A program written at the collective level contains both the function defini- 
tions of the parallel combinators (although these may be gathered in a separate 
library) and the full algorithm, expressed in the coordination language. For ex- 
ample, the complete program mss from Section 14.31 contains the definition of 
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piscELnl and its use. (It makes no difference in principle if the combinator is 
loaded from a standard library; it could be defined by the programmer, just like 
any other function.) 

When the program is expressed at the individual level, there is no defini- 
tion of the combinators. These are, in effect, magic operations provided by the 
interconnection network. The actual user code contains only the coordination al- 
gorithm; the combinators correspond to whatever combination of hardware and 
software the manufacturer used to implement the MPI library on its parallel 
system. 

Occasionally, manufacturers of parallel systems provide software tools that 
simulate the full system, but which run on ordinary workstations. This is in- 
tended to allow programmers to develop and debug their code on their own 
machines, saving expensive time on the parallel machine. The collective seman- 
tics is exactly like such a software tool; it allows the individual level program to 
be executed in Haskell, on an ordinary workstation, in order to find out what 
result the program will have on the parallel system. 

4.6 Soundness 

To support our transformational programming methodology, we need a sound- 
ness theorem for the monadic coordination language with the full complement 
of collective communication operations. A precise statement of the theorem, and 
its proof, will not be given here. 

A program written at the collective level (with normal scans and folds, for 
example) has just one meaning S IT”^], delivered by the standard semantics S and 
implemented by the Haskell compiler. A program P® written at the individual 
level has two distinct meanings: 

— If we have compiled P® into C-I-MPI and are running it on a real paral- 
lel system, the program treats collective communication operations like In- 
put/Output operations, with the interconnection network acting as a special 
I/O port. 

— The collective semantics gives the program P® a meaning in the functional 
world by simulating the parallel system hardware, including all the proces- 
sors and the interconnection network. 

5 Conclusion 

It is conventional to write SPMD programs at the individual level, but it it is 
clearer to write high level specifications more abstractly, at the collective level. 
In order to exploit transformational programming in the SPMD model, we need 
a special transformation rule that converts a collective level operation into a 
corresponding request at the individual level. In order to provide a formal justifi- 
cation for this collective-to-individual transformation, we provide a nonstandard 
collective semantics, and a soundness theorem. 
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This paper has taken only the first step; much further work remains. The 
next step is to formalise the transformation for a practical, complete SPMD 
programming language and to prove the soundness theorem. We could choose 
a reasonable subset of MPI as the set of collective communication operations; 
the approach proposed here would then make it possible to compile a high level 
combinator program almost all the way down to an executable C+MPI program, 
using formally justified transformations throughout the entire process. 
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